← Back to team overview

launchpad-dev team mailing list archive

Re: velocity: parallel testing or simplified merge machinery first

 

On Feb 7, 2011, at 9:32 PM, Robert Collins wrote:

> On Sat, Feb 5, 2011 at 12:10 AM, Gavin Panella
> <gavin.panella@xxxxxxxxxxxxx> wrote:
>> On 4 February 2011 04:28, Robert Collins <robertc@xxxxxxxxxxxxxxxxx> wrote:
>>> I'm wondering if folk have a particularly strong opinion (and
>>> rationale :P) for which we should do first. They are *both* partly
>>> implemented, and *both* are likely to have long tails leading to
>>> niggly bits to sort out over some weeks.
>> 
>> My gut feeling is that velocity is hurt most when:
>> 
>> 1. Branches get lost in ec2, especially when there's no message to
>>   tell me or anyone else about it. I might not notice anything the
>>   matter until the following day.
> 
> SMM will indeed help with this, but its extremely rare isn't it?
> Certainly on an individual basis that would stall.

Actually, I'm not entirely clear how SMM would help with this.  My picture of SMM includes people usually continuing to run the test suite locally.

> 
>> 2. Branches get bounced out of pqm. Again, this is exacerbated when
>>   there is no message to tell anyone about it. There's also sometimes
>>   a need to work with a LOSA to figure out what the reason was.
> 
> This is RT 43883 which I've just filed; we really need to get this
> /fixed/ and stop having half-stabs at it. I've asked Francis to give
> it pri 90 - zomg. Its really affecting developers a lot.

To be clear, that now-fixed RT is about fixing the silence of the bounces: yay, and thank you!

However, Gavin's #2 is still very pertinent: testfix mode bounces branches after a failed test run, by definition.  The SMM idea bounces the branch that failed tests, and any branches that were unfortunate enough to be run simultaneously, but subsequent branch landings are unaffected.  That's the heart of the change.  An intended side-effect is that it also drastically simplifies the collection of landing machinery we have.

I haven't commented on this thread before, so I'll collect a few additional thoughts here.

= TDD =

To address Aaron's comments, TDD can be done with a small subset of the test suite, and making that loop faster would mean reducing the time it takes to run one test or a few tests--by reducing the time it takes to start up their layers, say.  The parallel test initiative would not help there, as far as I can tell.

= Landing machinery vs. Parallel test suites =

I think fixing our landing machinery is a better goal than parallel test suites.  The pain I experience, and that my team reports, is tied up with landing issues such as testfix mode.  

That said, SMM is one approach to that goal.  If "parallel test suites" were recast as "fix our landing machinery by introducing parallel test suites of < 1 hour and PQM as it was before, with one branch at a time" (as you proposed) then I'd be very interested. Importantly, success on that effort would not have been achieved until the landing machinery were improved, to eliminate testfix mode and show that landing branches takes less time on average than now.

I think it would be worth analyzing the technical merits of the two approaches.  To agree with Julian's mail, the parallel test run story feels much riskier technically, but that's one person's (well, two people's ;-) ) observation of one aspect of the decision.  On the other side, solving the problem with parallel test suites  and single-branch PQM runs *should* reduce or eliminate the need for the separate ec2 test pre-runs, which would be a huge win.  The risk/reward balance might lean away from SMM, even with greater risk for parallel test runs.  Happily, that's not my call.

To repeat and summarize, the *problem to be solved* IMO and in the opinion of most other people on this thread is to make our landing story better.

= State of SMM =

If we do go down the road of SMM, I have some technical thoughts about the current state of that effort.  I've shared them with Francis before, so they should come as no surprise to him, but I haven't spoken more publicly.  I'll summarize here.

 - The Foundations effort was largely aiming for a proof of concept that Foundations could maintain and improve while it was running. That would not be the goal of a feature squad.  This means that the squad would have to expend more effort on it that Foundations would have initially.  It also would probably mean that the end result would be nicer.
 - Tarmac was not designed for what we needed and getting it ready for that functionality was more contentious and problematic than we expected.  These issues are still not resolved.
 - Francis and I agreed that other systems, like Hudson/Jenkins, *might* be elegantly extensible enough to handle what we need themselves.  In that case,  the "one running piece of software" would be Hudson/Jenkins with our extensions, rather than tarmac or PQM.  The question would be, what baseline functionality do we want to build off of?  Tarmac ended up not bringing up much to the table for this particular problem, other than Paul's energy and interest, which admittedly is very nice to have.  Hudson/Jenkins would at least bring visibility to the test runs, which would be very nice, and is missing from Tarmac AFAIK.

= Summary =

Parallel test runs are a means to an end.  If the end we strive for is significantly increasing the speed and reliability of our landing, that's a potentially compelling argument to me.  SMM approaches the right problem, IMO, but still has some work left to it, and might not be as nice as a parallel test run solution *for the landing problem*.  That said, parallel test runs have been problematic in past attempts.

Gary


Follow ups

References