yellow team mailing list archive

Thread
Date
Re: Python 2.7 and parallel testing

To: Graham Binns <graham@xxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Wed, 27 Jun 2012 08:51:16 -0400
Cc: Launchpad Yellow Squad <yellow@xxxxxxxxxxxxxxxxxxx>, Jelmer Vernooij <jelmer@xxxxxxxxxxxxx>, Vincent Ladeuil <vincent.ladeuil@xxxxxxxxxxxxx>, John Meinel <john.meinel@xxxxxxxxxxxxx>, Martin Packman <martin.packman@xxxxxxxxxxxxx>
In-reply-to: <CADyEPwumM3k1g_=CN+Qqg8uU_+UKtu_ezcEub-+_tVW1rtD0XQ@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1
On 06/27/2012 05:23 AM, Graham Binns wrote:
> Hi Gary,
> 
> Sorry that my postprandial email last night was a bit of a braindump.
> Here's some clarity to go with it.

:-) Great, thank you.

> On 26 June 2012 20:40, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:
>>> To work out
>>> what we actually needed to do, we ran the test suite on Precise.
>>> Martin ran it linearly in Canonistack and I updated our buildbot-slave
>>> charm to tell lpsetup to use precise.
>>
>> (Gary muttering to himself:) right, you told it to use Precise in the
>> containers.  The slave host was already precise.
> 
> Right. My change to the buildbot-slave charm was to add "-r precise" to
> the args passed to lpsetup in lpbuildbot.yaml.

OK, cool, simple enough to replicate.

It's also worth noting that we are effectively making two changes at
once here: both Precise *and* Python 2.7.  I suspect that any annoyances
we are encountering specific to parallel testing are more from the OS
variable than from the Python variable, but we can always try to isolate
that later if we need to.

>>> The Parallel testing results were interesting to say the least.
>>> Everything sets up correctly, and it was perfectly happy for me to
>>> start a build. However, the build ran for ~40 minutes and then died
>>
>> When you say "the build ran" what do you mean? When looking at the
>> buildbot master, was there anything on the stdout log of the test step?
>>  If so, what was it?  Did you also look at the build log (the log of the
>> previous step) to make sure it looked ok?
> 
> What I mean is that all the steps up to and including the build step ran
> fine. The test step also ran, but there was no output other than the
> messages about the LXC containers' IPs being added to known_hosts,
> followed by a bunch of LXC shutdown messages.

(Note that I am suspicious that the build step can return with a 0 exit
code even if it fails, so always double check the logs, and if you do
see a failed build with a 0 exit code--a green step--then please file a
bug/make a card.  It sounds like you checked the log, though, so I'll
proceed on the assumption that it really was OK.)

OK.  Given that the build step ran fine, which starts an LXC and goes
all the way through make schema, it seems that we have two possible
culprits: the ephemeral lxc containers trigger a problem somehow, or the
test runner on precise has a problem inside any lxc container.

I suggest shutting down the master so that it doesn't bother you, and
then messing with the slave.

First, preparation.

 * Make sure all containers are off.
 * On Precise, we are not supposed to have to make any changes (like,
say, for bug 1014916), but we haven't been testing Precise multiple
times a day, so there may be various problems on the LXC side that we
haven't encountered yet.
 * You could modify /usr/local/bin/lp-setup-lxc-test so that it passes
-vvv to bin/test but that's just window dressing and largely unnecessary.
 * If you are not testing on a 32 core machine, you will want to modify
or remove the --concurrency option in master.cfg, as we discussed on IRC.
 * For convenience, I suggest getting yourself a root password in the
container.  You might be able to do this with the ssh keys that buildbot
throws around, but I don't think so, and I always follow this recipe.
   * On the slave, set the password (passwd) for root.
   * Look in /etc/shadow for the hashed password. Copy that line.
   * Modify /var/lib/lxc/lptests/rootfs/etc/shadow to replace the
existing root hash (empty) with the one you made from the host.

Second, testing.

 * I'd try running the tests with --subunit in a non-ephemeral container
first.  run lxc-start -n lptests and log in as root.  Switch to buildbot
(and start bash because sh is miserable).  Go over to
~/slaves/slave/lucid_devel/build.  run bin/test --layer=UnitTest
--subunit or something like that to see if things seem copacetic.  If
so, shut down.
 * Next, I'd try something similar with an ephemeral (make sure the
lptests container has been shut down!).  I'd run lxc-start-ephemeral -o
lptests -d, and then I'd figure out the name of the ephemeral container
with lxc-list, and then I'd run, as root, ssh `lp-lxc-ip -i eth0 -n
NAME_OF_EPHEMERAL_CONTAINER`.  Then run the tests as before and see if
that works.
 * If you still don't have a lead, I'd poweroff the ephemeral and mess
with testr a bit.

>>> with a Twisted timeout error. It's as though the workers were
>>> semi-communicating with the master but just not reporting any tests.
>>
>> It would help to agree on terminology.  "workers" are what the buildbot
>> web interface currently calls ephemeral LXC containers.  They report to
>> a central testr process running in the LXC host.  This testr process in
>> turn is controlled by, and reports to, the buildbot slave process, also
>> running in the LXC host.  The buildbot slave process reports to the
>> buildbot master, running on the other juju machine.
>>
>> With that terminology, what was it you saw?
> 
> I have nothing concrete here, only an hypothesis based on what I could
> see (or not, as the case was).
> 
> The ephemeral LXC containers appear to have started. There were some
> complaints in the buildbot test log about them refusing connections; can
> we assume that was some kind of race condition between them starting and
> being available to SSH? 

If this were on a non-32 core machine, that would be expected and fine.

If it's on a 32 core machine, we have a problem similar yet different to
bug 1014916, and we should note that down as a problem to be
investigated and solved.  A low priority card on our kanban board would
be good as far as I am concerned.

> Anyway, that problem only occurred with a couple
> of them, and we had twenty-something new IPs added to known_hosts.
> 
> To clarify my statement: it looks as though the testr process was
> running but not reporting back to the slave process, or the slave
> process wasn't reporting to the master. I think that of the two the
> former is more likely, for two reasons:
> 
>  1) We did see output from the slave - the test log - and all the other
>     logs - have stuff in them that's meaningful.
>  2) When previously we've had a run where no tests have actually been
>     executed the run has finished almost as soon as it's started. This
>     run took around 40 minutes before the Twisted timeout error showed
>     up. I think it's safe to assume that _something_'s happening; I just
>     don't know what.

I understand now, thank you.

> 
>>> Does anyone have any ideas as to why this might be?
>>
>> Not yet. :-) If you give me a branch or directions or something I'll be
>> happy to poke at it.  Alternatively, you could poke yourself, but I'll
>> hold off on giving poking ideas until I hear more about the symptoms.
> 
> I'm going to run things again this morning - once I've dealt with a
> persistent install error in the slave charm (wasn't happening
> yesterday). 

I'm getting that too (running the usual Lucid setup), and will
investigate soon.  Lemme know if you already have an answer!

> I'll try poking around in the containers when the tests are
> running.
> 
> As for a branch to play with, I was using devel yesterday; I see no
> reason to try anything else yet. Simply updating the slave config to add
> "-r precise" to the lpsetup args should be sufficient, unless I'm
> misunderstanding.

Cool, thanks.

> 
>>> We might not need
>>> worry too much about it just yet but once production is running on
>>> Precise our LXC containers will need to as well, and it will become an
>>> issue then.
>>
>> I'll talk with Francis today about this.  I would personally hope that
>> Precise would not be usable in Production until it could work in
>> parallel tests.  One way or another, it is a big deal for us collectively.
> 
> Agreed. This is a bit of a race condition, but as jam observed earlier,
> it's more likely that we'll have parallel testing done first than Python
> 2.7 compatibility; I guess the position that logically follows is for us
> to test against Python 2.7 in a parallel environment on a regular basis.

As threatened, I spoke to Francis about this.  His position was that,
while you and the Blue squad are working together, you ought to take
advantage of the position to investigate, and we ought to support you as
you request.  Once you've returned from the sprint, we ought to focus on
Lucid and lpsetup for now, and add in Precise later, as a discrete step
for sometime in the future.

OTOH, if you do happen to get everything working by the end of the week,
I will be inclined to set up regular tests so we don't bit rot from that
position too much without an early warning.

> I'll let you know how things go with the poking around.

Thank you!

Gary
Follow ups

Re: Python 2.7 and parallel testing
From: Graham Binns, 2012-06-27
References

Python 2.7 and parallel testing
From: Graham Binns, 2012-06-26
Re: Python 2.7 and parallel testing
From: Gary Poster, 2012-06-26
Re: Python 2.7 and parallel testing
From: Graham Binns, 2012-06-27