yellow team mailing list archive

Thread
Date

Re: Python 2.7 and parallel testing

To: Gary Poster <gary.poster@xxxxxxxxxxxxx>
From: Graham Binns <graham@xxxxxxxxxxxxx>
Date: Wed, 27 Jun 2012 10:23:09 +0100
Cc: Launchpad Yellow Squad <yellow@xxxxxxxxxxxxxxxxxxx>, Jelmer Vernooij <jelmer@xxxxxxxxxxxxx>, Vincent Ladeuil <vincent.ladeuil@xxxxxxxxxxxxx>, John Meinel <john.meinel@xxxxxxxxxxxxx>, Martin Packman <martin.packman@xxxxxxxxxxxxx>
In-reply-to: <4FEA1037.5010008@canonical.com>
Sender: graham.binns@xxxxxxxxx

Hi Gary,

Sorry that my postprandial email last night was a bit of a braindump.
Here's some clarity to go with it.

On 26 June 2012 20:40, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:
>> To work out
>> what we actually needed to do, we ran the test suite on Precise.
>> Martin ran it linearly in Canonistack and I updated our buildbot-slave
>> charm to tell lpsetup to use precise.
>
> (Gary muttering to himself:) right, you told it to use Precise in the
> containers.  The slave host was already precise.

Right. My change to the buildbot-slave charm was to add "-r precise" to
the args passed to lpsetup in lpbuildbot.yaml.

>> The Parallel testing results were interesting to say the least.
>> Everything sets up correctly, and it was perfectly happy for me to
>> start a build. However, the build ran for ~40 minutes and then died
>
> When you say "the build ran" what do you mean? When looking at the
> buildbot master, was there anything on the stdout log of the test step?
>  If so, what was it?  Did you also look at the build log (the log of the
> previous step) to make sure it looked ok?

What I mean is that all the steps up to and including the build step ran
fine. The test step also ran, but there was no output other than the
messages about the LXC containers' IPs being added to known_hosts,
followed by a bunch of LXC shutdown messages.

>> with a Twisted timeout error. It's as though the workers were
>> semi-communicating with the master but just not reporting any tests.
>
> It would help to agree on terminology.  "workers" are what the buildbot
> web interface currently calls ephemeral LXC containers.  They report to
> a central testr process running in the LXC host.  This testr process in
> turn is controlled by, and reports to, the buildbot slave process, also
> running in the LXC host.  The buildbot slave process reports to the
> buildbot master, running on the other juju machine.
>
> With that terminology, what was it you saw?

I have nothing concrete here, only an hypothesis based on what I could
see (or not, as the case was).

The ephemeral LXC containers appear to have started. There were some
complaints in the buildbot test log about them refusing connections; can
we assume that was some kind of race condition between them starting and
being available to SSH? Anyway, that problem only occurred with a couple
of them, and we had twenty-something new IPs added to known_hosts.

To clarify my statement: it looks as though the testr process was
running but not reporting back to the slave process, or the slave
process wasn't reporting to the master. I think that of the two the
former is more likely, for two reasons:

 1) We did see output from the slave - the test log - and all the other
    logs - have stuff in them that's meaningful.
 2) When previously we've had a run where no tests have actually been
    executed the run has finished almost as soon as it's started. This
    run took around 40 minutes before the Twisted timeout error showed
    up. I think it's safe to assume that _something_'s happening; I just
    don't know what.

>> Does anyone have any ideas as to why this might be?
>
> Not yet. :-) If you give me a branch or directions or something I'll be
> happy to poke at it.  Alternatively, you could poke yourself, but I'll
> hold off on giving poking ideas until I hear more about the symptoms.

I'm going to run things again this morning - once I've dealt with a
persistent install error in the slave charm (wasn't happening
yesterday). I'll try poking around in the containers when the tests are
running.

As for a branch to play with, I was using devel yesterday; I see no
reason to try anything else yet. Simply updating the slave config to add
"-r precise" to the lpsetup args should be sufficient, unless I'm
misunderstanding.

>> We might not need
>> worry too much about it just yet but once production is running on
>> Precise our LXC containers will need to as well, and it will become an
>> issue then.
>
> I'll talk with Francis today about this.  I would personally hope that
> Precise would not be usable in Production until it could work in
> parallel tests.  One way or another, it is a big deal for us collectively.

Agreed. This is a bit of a race condition, but as jam observed earlier,
it's more likely that we'll have parallel testing done first than Python
2.7 compatibility; I guess the position that logically follows is for us
to test against Python 2.7 in a parallel environment on a regular basis.

I'll let you know how things go with the poking around.

-- 
Graham Binns | PGP Key: EC66FA7D
http://launchpad.net/~gmb

Follow ups

Re: Python 2.7 and parallel testing
From: Gary Poster, 2012-06-27

References

Python 2.7 and parallel testing
From: Graham Binns, 2012-06-26
Re: Python 2.7 and parallel testing
From: Gary Poster, 2012-06-26