← Back to team overview

yellow team mailing list archive

Re: parallel testing LEP questions

 

On Wed, Nov 23, 2011 at 1:13 PM, Robert Collins
<robertc@xxxxxxxxxxxxxxxxx> wrote:
>> If you don't intend to recommend/prescribe LXC + testr, these next
>> two question are pertinent.
>>
>> - You write that the solution "[m]ust parallelise more effectively
>> than bin/test -j (which does per-layer splits)."  Is that really a
>> "must"? If we met your success metric ("down to less than 50% of the
>> current time, preferrable 15%-20%"), would it really matter which
>> method got there?  If it does matter, can you identify what the
>> underlying "must" is for rejecting the -j approach, so that, for
>> instance, other solutions can be cleanly rejected?
>
> Our test distribution per layer is not very even - I highly doubt that
> we'd be able to meet a reduction to 15% of the current time splitting
> per layer.

Let's look at the test distribution: The last buildbot run took 360
minutes.  There were 4 layers that took longer than 11 minutes to run:
55, 56, 65, and 99 minutes.  All the other layers add up to about
60 minutes.

If we bisect the four largest layers (to make it so the test runner's
blind layer scheduling can't bite us too hard) and assume that running 4
layers simultaneously imposes no more than a 50% overhead, then we would
be right at 40% of the current running time.

Reasoning sidebar: 99 is the length in minutes of the longest layer; it
was bisected, but even then its other half is still the longest
remaining layer so for pessimism's sake we assume they get run one after
another.  All the other layers would be finished by then, so that gives
us 99*1.50/360 = .41.

Even if we assume no parallelization overhead, per-test scheduling (as
opposed to per-layer as above) and four-way parallelization, we'll still
be at 25% of the original time, so I'm interested in ideas as to how we
might achieve a reduction to 15% of the original time.

> The other issue of shared global state that will bite us,
> will also be a significant issue with -j, unless a remoting facility
> is brought in (and at that point it seems to be reinventing
> subunit.... :P).

This is the real catch.  If the tests haven't been written to be
parallelizable (which LP's certainly have not), then global state
collisions accumulated over years of assuming non-parallel tests could
be hard to fix.  On the other hand, if fixing them turns out to be easy,
then using the test runner's built-in parallelization (-j) would be the
most bang for the buck.

-- 
Benji York


Follow ups

References