launchpad-dev team mailing list archive

Thread
Date

Re: release branches

To: Tom Haddon <tom.haddon@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Mon, 18 Oct 2010 22:54:26 +1300
Cc: losas <losas@xxxxxxxxxxxxx>, Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <1287394181.25585.43.camel@hurlyburly>
Sender: robertc@xxxxxxxxxxxxxxxxx

On Mon, Oct 18, 2010 at 10:29 PM, Tom Haddon <tom.haddon@xxxxxxxxxxxxx> wrote:
>> And at that point, if we have a security issue we have to deploy asap;
>> we'd do the following:
>>  - cowboy it out there [and keep it as a cowboy on future deploys]
>
> So this means we'd be deploying a security fix without having run the
> test suite against it in a controlled environment (i.e. buildbot/PQM)?

No, we can verify things in ec2.

>>  - land a regular branch fixing it for good
>>  - remove the cowboy when the regular branch has been incorporated
>> into the main deployed codebase.
>>
>> This would chop 4 hours off the time that things take to deploy,
>> remove one buildbot queue and generally make the whole code->live
>> story a bit simpler, at the cost of making the security-fix story more
>> complex. Personally, I think that that is a net win.
>
> From the LOSA perspective, it's also a lot more work. It basically
> requires manually applying a cowboy, keeping track of where that cowboy
> is applied, disabling any auto-rollouts to that server until the cowboy
> lands, and/or checking there are no cowboys applied on any servers
> before doing any rollouts.

We already have a means for handling cowboys; we only gain
incrementally here if we can eliminate that entire process - can we?
I'd say we *can't* today because 'zomg fix it now' stuff does happen.

I'd like to quantify how much more work it is. Say that there is one
security landing a month, and we're deploying individual revisions.
The extra work for handling security via a cowboy is then amortised
over 200 commits (to take the last month). If we save 5 minutes on the
inner loop for those 200 commits, and spend 2 hours dealing with that
security fix, we're still ahead 80 minutes.

Thats not to say that 2 hours would be tolerable or a goal for
security fixes, just that *overall* its a win to take it out of the
common case completely.

> I'd propose a slight change to the above suggestion:
>
> - Keep production-devel/production-stable (now the buildbot instances
> run in the DC, there's no extra cost to doing so).

There is a cost: we have to deal with test runs that fail; we have to
resource test runs on it. If we parallelise the test suite we're going
to be wanting serious grunt to run the test suite, and that CPU time
doesn't come free. We also need engineering and sysadmin time to
manage the instance and have to handle dealing with it during upgrades
(e.g. the lucid one we just did).

That said, I'm open to keeping it but not deploying from it except
when there is a security issue.

> - Have an automated job that pulls frequently (or pushes immediately)
> from the "approved" stable revno to production-stable

we can do that, but I'd do it ondemand: If prepping a security fix,
request this, and build on that.

> - Security fixes still go through production-devel -> production stable
> and can then subsequently be landed on devel->stable after having been
> rolled out.

Sure.

> The advantage of this is that LOSAs can *always* deploy from the tip of
> production-stable.

I don't see that being implied by the changes you suggest.

> No approval is needed, and once we get to the stage
> of automating deployments that becomes a *lot* easier.

We're aiming for automated-doing-the-deployment not
automated-triggering-of-deployments.

The former adds reliability and speed to doing them, the latter adds
risk in the event that people are busy.

We already, per the new process, have trivial-approvement deployments
(though our toolchain needs to catch up, and we can't actually
*action* it till we have qastaging live with edge deployments turned
off).

So, for clarity, how does the following strike you as an interim
position (with a review after 6 months):
 - keep prod-stable/devel
 - on request deployments from stable except when doing a security fix
 - cron job to push from stable to prod-devel/prod-stable

-Rob

Follow ups

Re: release branches
From: Tom Haddon, 2010-10-18

References

release branches
From: Robert Collins, 2010-09-28
Re: release branches
From: Tom Haddon, 2010-10-18