← Back to team overview

launchpad-dev team mailing list archive

Re: release branches

 

On Mon, Oct 18, 2010 at 11:53 PM, Tom Haddon <tom.haddon@xxxxxxxxxxxxx> wrote:
> On Mon, 2010-10-18 at 22:54 +1300, Robert Collins wrote:
>> On Mon, Oct 18, 2010 at 10:29 PM, Tom Haddon <tom.haddon@xxxxxxxxxxxxx> wrote:
>> >> And at that point, if we have a security issue we have to deploy asap;
>> >> we'd do the following:
>> >>  - cowboy it out there [and keep it as a cowboy on future deploys]
>> >
>> > So this means we'd be deploying a security fix without having run the
>> > test suite against it in a controlled environment (i.e. buildbot/PQM)?
>>
>> No, we can verify things in ec2.
>
> *can* being the operative word here. But even so, I don't think ec2 is
> the "canonical" copy of production environment, otherwise we wouldn't
> need buildbot at all, right?

That doesn't follow. If merges were serialised, ec2 would be fine and
buildbot not needed. But two developers landing at once can tread on
each other, so buildbot needs to exist to act as a serialisation
point.

Prior to lucid, buildbot and ec2 were identical (and we had less problems).
>> We already have a means for handling cowboys; we only gain
>> incrementally here if we can eliminate that entire process - can we?
>> I'd say we *can't* today because 'zomg fix it now' stuff does happen.
>
> Erm, we don't already have a means for handling cowboys. We currently
> have a hacked work around that is very painful and potentially error
> prone for LOSAs. I'm trying to avoid that in the future, and I don't
> think "we do this currently" is a good justification.

Ok, so can we:
 - rule out zomg emergencies
 - automate it so its neither painful nor error prone
 - while still fast.

E.g. - sketching - have a 'cowboy.patch' file that lives in the
deployment area, is persistent, reported on by the deployment scripts
(- a diffstat will do, that won't show security things, will tell us
there is something there).

>> I'd like to quantify how much more work it is. Say that there is one
>> security landing a month, and we're deploying individual revisions.
>> The extra work for handling security via a cowboy is then amortised
>> over 200 commits (to take the last month). If we save 5 minutes on the
>> inner loop for those 200 commits, and spend 2 hours dealing with that
>> security fix, we're still ahead 80 minutes.
>>
>> Thats not to say that 2 hours would be tolerable or a goal for
>> security fixes, just that *overall* its a win to take it out of the
>> common case completely.
>
> You're not really comparing like for like here. You're comparing 5
> minutes (or whatever it is) of extra time to deploy something to 2 hours
> (or whatever it is) of extra LOSA time for a cowboy (plus the danger of
> overwriting the cowboy through human error).

I'm comparing 5 minutes of *latency in the developer window before
they can handoff /any patch/* to 2 hours of losa time to do a cowboy +
the existing risk of overwriting the cowboy via human error. Thats
1000 minutes a month spread over the team, or 1/2 hour a month per
person where they have to choose between context switching and slowing
down other people, or not context switching and waiting for the
system.

I also am amazed at the assertion that it takes 2 hours to cowboy;
I've seen a cowboy, and it didn't take the losa doing it 2 hours.

>> > I'd propose a slight change to the above suggestion:
>> >
>> > - Keep production-devel/production-stable (now the buildbot instances
>> > run in the DC, there's no extra cost to doing so).
>>
>> There is a cost: we have to deal with test runs that fail; we have to
>> resource test runs on it. If we parallelise the test suite we're going
>> to be wanting serious grunt to run the test suite, and that CPU time
>> doesn't come free. We also need engineering and sysadmin time to
>> manage the instance and have to handle dealing with it during upgrades
>> (e.g. the lucid one we just did).
>>
>> That said, I'm open to keeping it but not deploying from it except
>> when there is a security issue.
>
> This kind of defeats the point. If we're not deploying from the same
> branch all the time, there's extra manual (and error prone) steps
> involved.

AIUI its something like --branch=production-stable on the call to
deploy, isn't it? Thats a pretty safe thing to do *in the exceptional
case*, which is (by my  back of hand estimate) < 0.5% of the deploys
we *want to be doing*.

>> > - Have an automated job that pulls frequently (or pushes immediately)
>> > from the "approved" stable revno to production-stable
>>
>> we can do that, but I'd do it ondemand: If prepping a security fix,
>> request this, and build on that.
>>
>> > - Security fixes still go through production-devel -> production stable
>> > and can then subsequently be landed on devel->stable after having been
>> > rolled out.
>>
>> Sure.
>>
>> > The advantage of this is that LOSAs can *always* deploy from the tip of
>> > production-stable.
>>
>> I don't see that being implied by the changes you suggest.
>
> Erm, I must not be explaining it properly then, because that's *exactly*
> the outcome of what I'm proposing. Can you let me know how that's not
> clear so I can try and explain it a little more?

We can have production-stable, a cron job/buildbot whatever keeping it
synced most of the time, and still choose to deploy from *stable* as
the default.

>> > No approval is needed, and once we get to the stage
>> > of automating deployments that becomes a *lot* easier.
>>
>> We're aiming for automated-doing-the-deployment not
>> automated-triggering-of-deployments.
>
> Ok, so now I definitely am misunderstanding things. Are you saying we
> don't want to automatically roll things out (as a longer term goal)? If
> not, I don't think the extra 5 minutes of using the production-stable
> branch (which means we consistently deploy from the same branch) will
> make any difference.

The *highest* period of risk for us, given that we're a mature system,
with fairly well known user load etc, is the period between doing a
deployment, and the new code in the deployment being exercised. Its
risk to deploy without a human in the loop, ready to rollback if it
goes pear shaped. I don't see that risk gaining us anything, which is
why I am not interested in us taking it.
I want deployments to be trivial things, easy to do, and done a lot
but never triggered in an automatic fashion. Francis wants this too,
but AIUI he would like to see automated triggers for deployment.... so
we disagree on this - but we agree on everything up to that point. (I
know he'll come around :))

>> The former adds reliability and speed to doing them, the latter adds
>> risk in the event that people are busy.
>>
>> We already, per the new process, have trivial-approvement deployments
>> (though our toolchain needs to catch up, and we can't actually
>> *action* it till we have qastaging live with edge deployments turned
>> off).
>>
>> So, for clarity, how does the following strike you as an interim
>> position (with a review after 6 months):
>>  - keep prod-stable/devel
>>  - on request deployments from stable except when doing a security fix
>>  - cron job to push from stable to prod-devel/prod-stable
>
> To be clear, I'm proposing a cron job to push from stable to
> production-devel and production-stable so the test suite doesn't have to
> run for production-devel -> production-stable, unless we're doing a
> security fix. Not sure if that was clear.

Thanks, yes, it was clear - but its still work and maintenance that
has to justify its existence. Perhaps it does; I'm happy with the
compromise I'm suggesting, because it avoids both cowboys (which are
unpleasant) and prod-devel being cared about : 33% less branches to
maintain in production-quality-status.

Long term we probably want to have cake and eat it too - e.g.
protection against non-stable deployments being reverted, but again,
this should be extraordinarily rare that we're in that mode at all.

-Rob



References