← Back to team overview

launchpad-dev team mailing list archive

Re: release branches

 

On Mon, 2010-10-18 at 22:54 +1300, Robert Collins wrote:
> On Mon, Oct 18, 2010 at 10:29 PM, Tom Haddon <tom.haddon@xxxxxxxxxxxxx> wrote:
> >> And at that point, if we have a security issue we have to deploy asap;
> >> we'd do the following:
> >>  - cowboy it out there [and keep it as a cowboy on future deploys]
> >
> > So this means we'd be deploying a security fix without having run the
> > test suite against it in a controlled environment (i.e. buildbot/PQM)?
> 
> No, we can verify things in ec2.

*can* being the operative word here. But even so, I don't think ec2 is
the "canonical" copy of production environment, otherwise we wouldn't
need buildbot at all, right? 

> >>  - land a regular branch fixing it for good
> >>  - remove the cowboy when the regular branch has been incorporated
> >> into the main deployed codebase.
> >>
> >> This would chop 4 hours off the time that things take to deploy,
> >> remove one buildbot queue and generally make the whole code->live
> >> story a bit simpler, at the cost of making the security-fix story more
> >> complex. Personally, I think that that is a net win.
> >
> > From the LOSA perspective, it's also a lot more work. It basically
> > requires manually applying a cowboy, keeping track of where that cowboy
> > is applied, disabling any auto-rollouts to that server until the cowboy
> > lands, and/or checking there are no cowboys applied on any servers
> > before doing any rollouts.
> 
> We already have a means for handling cowboys; we only gain
> incrementally here if we can eliminate that entire process - can we?
> I'd say we *can't* today because 'zomg fix it now' stuff does happen.

Erm, we don't already have a means for handling cowboys. We currently
have a hacked work around that is very painful and potentially error
prone for LOSAs. I'm trying to avoid that in the future, and I don't
think "we do this currently" is a good justification.

> I'd like to quantify how much more work it is. Say that there is one
> security landing a month, and we're deploying individual revisions.
> The extra work for handling security via a cowboy is then amortised
> over 200 commits (to take the last month). If we save 5 minutes on the
> inner loop for those 200 commits, and spend 2 hours dealing with that
> security fix, we're still ahead 80 minutes.
> 
> Thats not to say that 2 hours would be tolerable or a goal for
> security fixes, just that *overall* its a win to take it out of the
> common case completely.

You're not really comparing like for like here. You're comparing 5
minutes (or whatever it is) of extra time to deploy something to 2 hours
(or whatever it is) of extra LOSA time for a cowboy (plus the danger of
overwriting the cowboy through human error).

> > I'd propose a slight change to the above suggestion:
> >
> > - Keep production-devel/production-stable (now the buildbot instances
> > run in the DC, there's no extra cost to doing so).
> 
> There is a cost: we have to deal with test runs that fail; we have to
> resource test runs on it. If we parallelise the test suite we're going
> to be wanting serious grunt to run the test suite, and that CPU time
> doesn't come free. We also need engineering and sysadmin time to
> manage the instance and have to handle dealing with it during upgrades
> (e.g. the lucid one we just did).
> 
> That said, I'm open to keeping it but not deploying from it except
> when there is a security issue.

This kind of defeats the point. If we're not deploying from the same
branch all the time, there's extra manual (and error prone) steps
involved.

> > - Have an automated job that pulls frequently (or pushes immediately)
> > from the "approved" stable revno to production-stable
> 
> we can do that, but I'd do it ondemand: If prepping a security fix,
> request this, and build on that.
> 
> > - Security fixes still go through production-devel -> production stable
> > and can then subsequently be landed on devel->stable after having been
> > rolled out.
> 
> Sure.
> 
> > The advantage of this is that LOSAs can *always* deploy from the tip of
> > production-stable.
> 
> I don't see that being implied by the changes you suggest.

Erm, I must not be explaining it properly then, because that's *exactly*
the outcome of what I'm proposing. Can you let me know how that's not
clear so I can try and explain it a little more?

> > No approval is needed, and once we get to the stage
> > of automating deployments that becomes a *lot* easier.
> 
> We're aiming for automated-doing-the-deployment not
> automated-triggering-of-deployments.

Ok, so now I definitely am misunderstanding things. Are you saying we
don't want to automatically roll things out (as a longer term goal)? If
not, I don't think the extra 5 minutes of using the production-stable
branch (which means we consistently deploy from the same branch) will
make any difference.

> The former adds reliability and speed to doing them, the latter adds
> risk in the event that people are busy.
> 
> We already, per the new process, have trivial-approvement deployments
> (though our toolchain needs to catch up, and we can't actually
> *action* it till we have qastaging live with edge deployments turned
> off).
> 
> So, for clarity, how does the following strike you as an interim
> position (with a review after 6 months):
>  - keep prod-stable/devel
>  - on request deployments from stable except when doing a security fix
>  - cron job to push from stable to prod-devel/prod-stable

To be clear, I'm proposing a cron job to push from stable to
production-devel and production-stable so the test suite doesn't have to
run for production-devel -> production-stable, unless we're doing a
security fix. Not sure if that was clear.

Tom

> -Rob





Follow ups

References