yellow team mailing list archive

Thread
Date
Re: parallel testing LEP questions

To: Gary Poster <gary.poster@xxxxxxxxxxxxx>
From: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
Date: Thu, 24 Nov 2011 07:13:46 +1300
Cc: Launchpad Yellow Squad <yellow@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CA1D4FA7-786A-40C5-8977-40EB8A7A73EA@canonical.com>
On Thu, Nov 24, 2011 at 6:26 AM, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:
> Hey Robert.  Francis mentioned that you had updated the parallel testing LEP so I took a moment to look at it today.
>
> I cc'd the yellow squad to keep us all in the loop.  Hi everybody!  The LEP is https://dev.launchpad.net/LEP/ParallelTesting if you want to take a look.
>
> Could you clarify these points, ideally on the LEP?
>
> - You write that we must "[o]rganise and upgrade our CI test running to take advantage of this new environment."  You also clarify that "[c]hanging the landing technology is out of scope."  To make sure I understand, then, you want us to keep buildbot and everything else as-is as much as possible, but guide LOSAs to getting us machines/VMs that can quickly and robustly run these tests.  Is this right?  If so, no additional LEP clarification needed, I think, but otherwise, please give us more information there.

Yes. I am keep to overhaul our CI story as well but that seems to be
entirely orthogonal to me. Francis and I would be very open to an
argument that this is misguided :). So 'Yes, thats right'.

> - You write in comments that "The prototype LXC + testr based parallelisation seems to have the best effort-reward tradeoff today."  [Yellow folks, I found https://dev.launchpad.net/ParallelTests to describe the prototype.]  Have you done enough research here that you are able to recommend or even prescribe this approach?  That would probably save time, if so; and though it violates my understanding of LEP goals to have an implementation prescribed, I think that ought to be relaxed for documents written by the TA.

I think prescribing it would make sense. Uhm, basically we have a
tonne of same-machine shared state still lurking in our test suite
that would make a shared machine parallel environment (e.g. splitting
by threads or processes) a high risk endeavour - we could be months
finding and fixing such things when they bite us (and as race
conditions that would be a source of continual pain rather than a
clear 'wow thats broken lets fix it' scenario. For instance once such
things is the oops directory for tests /var/tmp/lperr.test which even
now isn't quite gone or unique-per-test-run. One very solid way to
mitigate against the risk of such race conditions is to have separate
machines, but separate machines present a coordination and setup
overhead: syncing code around, creating template dbs - none of these
things are free. We've demonstrated parallel testing using subunit on
multiple machines years ago and it was very effective - and more
recently Aaron has done a canonistack helper that does parallel
machines. LXC offers a way to make very efficient use of one machine
with the benefits of having separate machines and with out [most] of
the overhead of having separate machines or VM's.

> - If we use LXC, do you expect this effort to dig into the fragility that you note in your prototype notes, and try to improve it?  If not, do you have requirements or thoughts on how to help developers work with the issues--perhaps scripts that developers are encouraged to use for the workflow, that handle problems like the ones you identify ("you may need to manually shutdown postgresql before stopping lxc, to get it to shutdown cleanly")?

Serge Hallyn and the Ubuntu server team are driving LXC to be a local
cloud deployment environment - its a key feature goal for Precise. I
expect that we can benefit from this work (even without running
Precise on the server/VM that hosts the LXC instance - though we can
do that if needed). There will be things we need to automate etc. I
expect the squad to run into some curly problems they need to escalate
to the Ubuntu server team for assistance on, but that most of them
have been identified and workarounds sketched or implemented by the
LXC experiments wgrant and I were doing.

> - If we use LXC, you describe a number of steps to set up a working environment.  Do you envision a rocketfuel-XXX style script to help produce this environment?  If so, do you have any requirements for it?  If not, do you have something else in mind, and can we extract requirements from that?

There are two components here - the installation of dependencies/setup
of libvirt/LXC host configuration and the creation of the LXC
template. Whether the first component is manual or automated doesn't
really bother me - I'm happy with automation, but OTOH its a one-time
cost to get the LOSAs to setup a working environment on e.g. the
buildbod slave.

However I think its important that we be able to rebuild an LXC
template rapidly (or even in the test run itself) as we *may* find
that that is the most reliable way to ensure everything is just so.
LXC environments can persist - certainly the template environment for
the test can persist, and there are some latency benefits in having
that, but a totally clean environment is also very beneficial. The LXC
guest itself is only a couple hundred MB, so pretty fast to bootstrap
once LXC has cached all the bits.

> If you don't intend to recommend/prescribe LXC + testr, these next two question are pertinent.
>
> - You write that the solution "[m]ust parallelise more effectively than bin/test -j (which does per-layer splits)."  Is that really a "must"? If we met your success metric ("down to less than 50% of the current time, preferrable 15%-20%"), would it really matter which method got there?  If it does matter, can you identify what the underlying "must" is for rejecting the -j approach, so that, for instance, other solutions can be cleanly rejected?

Our test distribution per layer is not very even - I highly doubt that
we'd be able to meet a reduction to 15% of the current time splitting
per layer. The other issue of shared global state that will bite us,
will also be a significant issue with -j, unless a remoting facility
is brought in (and at that point it seems to be reinventing
subunit.... :P).

> - Francis had said earlier when talking with me about the project that running the tests on multiple machines might be a acceptable way to achieve the goal.  You specifically disallow that, even with the LEP title ("Single machine parallel testing of single branches"), even though doing this with multiple machines would match the letter of the law (the biggest stretch I see is that "[p]ermit[ting] developers to reliably run parallelised as well" would mean that developers would need to run ec2 to meet that requirement).  As with the previous question, is there a deeper "must" hidden in here somewhere?  Perhaps it is cost related?

Not really - if using multiple machines/full blown vms is the right
way forward, we can do that. I believe that in the single developer
case that the performance will be significantly worse than that with
LXC due to the lack of shared page cache and increase disk IO. Also
you then have the overheads of code sync across to the
VM's/machines...

> That's all I've got so far. :-)

Thanks, these are great questions, please keep them coming!

-Rob
Follow ups

Re: parallel testing LEP questions
From: Benji York, 2011-11-23
References

parallel testing LEP questions
From: Gary Poster, 2011-11-23