← Back to team overview

launchpad-dev team mailing list archive

Re: velocity: parallel testing or simplified merge machinery first

 

Sorry for not replying for a bit, got distracted.

On Wed, Feb 9, 2011 at 2:04 AM, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:
>>> 1. Branches get lost in ec2, especially when there's no message to
>>>   tell me or anyone else about it. I might not notice anything the
>>>   matter until the following day.
>>
>> SMM will indeed help with this, but its extremely rare isn't it?
>> Certainly on an individual basis that would stall.
>
> Actually, I'm not entirely clear how SMM would help with this.  My picture of SMM includes people usually continuing to run the test suite locally.

I think SMM would help because people could optimistically throw it
direct to land : ec2 wouldn't be needed so much, if at all.

>>> 2. Branches get bounced out of pqm. Again, this is exacerbated when
>>>   there is no message to tell anyone about it. There's also sometimes
>>>   a need to work with a LOSA to figure out what the reason was.
>>
>> This is RT 43883 which I've just filed; we really need to get this
>> /fixed/ and stop having half-stabs at it. I've asked Francis to give
>> it pri 90 - zomg. Its really affecting developers a lot.
>
> To be clear, that now-fixed RT is about fixing the silence of the bounces: yay, and thank you!
>
> However, Gavin's #2 is still very pertinent: testfix mode bounces branches after a failed test run, by definition.  The SMM idea bounces the branch that failed tests, and any branches that were unfortunate enough to be run simultaneously, but subsequent branch landings are unaffected.  That's the heart of the change.  An intended side-effect is that it also drastically simplifies the collection of landing machinery we have.

Over the last month about 80% (NotAMetric) of our testfixes have been
due to flaky tests failing sporadically: we've not debugged them, and
so they keep happening. SMM won't help with this at all - its
orthogonal. It will, should a sporadic failure *introduced in a mege*
trigger during that merge, keep it out. But most (if not all) of our
sporadic failures get by ec2 land. So I think SMM will help with 20%
of our testfixes - an important number, but its not an absolute fix.
(I'm not saying it has been presented as one either).

> I haven't commented on this thread before, so I'll collect a few additional thoughts here.

Thanks.

> = Landing machinery vs. Parallel test suites =
>
> I think fixing our landing machinery is a better goal than parallel test suites.  The pain I experience, and that my team reports, is tied up with landing issues such as testfix mode.
>
> That said, SMM is one approach to that goal.  If "parallel test suites" were recast as "fix our landing machinery by introducing parallel test suites of < 1 hour and PQM as it was before, with one branch at a time" (as you proposed) then I'd be very interested. Importantly, success on that effort would not have been achieved until the landing machinery were improved, to eliminate testfix mode and show that landing branches takes less time on average than now.

I think thats a great goal, but in the interest of having things
decoupled I'm not going to put it in my parallel testing LEP. I *will*
put in a weaker goal, which is that buildbot runs its tests for devel
and db-devel using parallel testing.

That would put us in a good position to *either* move to SMM with one
branch at a time, or migrate PQM to a parallel test capable machine
and turn pqm running the test suite back on. *either* of those would
satisfy the locking-out of the 20% of test failures that are not
sporadic.

> I think it would be worth analyzing the technical merits of the two approaches.  To agree with Julian's mail, the parallel test run story feels much riskier technically, but that's one person's (well, two people's ;-) ) observation of one aspect of the decision.  On the other side, solving the problem with parallel test suites  and single-branch PQM runs *should* reduce or eliminate the need for the separate ec2 test pre-runs, which would be a huge win.  The risk/reward balance might lean away from SMM, even with greater risk for parallel test runs.  Happily, that's not my call.
>
> To repeat and summarize, the *problem to be solved* IMO and in the opinion of most other people on this thread is to make our landing story better.

Thats certainly *a* goal, and one that is important. But its also a
subordinate goal to the overriding one: better velocity. Like you I
think there is a risk/reward balance here - and we can keep thinking
about it until 2 more teams have finished projects and blue is ready
to come out of maintenance mode: thats when the next TA work item is
scheduled.

In terms of velocity, failed landings are a big burden *because* they
are slow to recover from. If we /could/ eliminate all failed landings
with one project, then the time to recover would stop being a factor.
We can't though until we eliminate all sporadic test failures (which
we should do) - and guarantee that no more will occur (this is
probably impossible). I think we need  a multiprong approach: make it
faster to recover from sporadic test failures (which will also make it
easier to fix them), and make it harder / impossible to introduce
completely failed tests (by getting premerge testing back on the
table). Which we do first should depend on which will bring us the
most reward, and *because* most of our failures are intermittent
failures rather than complete failures, I'm leaning heavily towards
reducing the time to recover from those failures as the thing that
will make the most difference to us today.

> = State of SMM =
>
> If we do go down the road of SMM, I have some technical thoughts about the current state of that effort.  I've shared them with Francis before, so they should come as no surprise to him, but I haven't spoken more publicly.  I'll summarize here.

...

Thanks for the status report - thats a very useful thing to know!

Cheers,
Rob



References