launchpad-dev team mailing list archive

Thread
Date

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'

To: Michael Hudson <michael.hudson@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Tue, 20 Jul 2010 06:39:47 +0200
Cc: launchpad-dev@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C44DDD8.4090007@canonical.com>
Sender: robertc@xxxxxxxxxxxxxxxxx

On Tue, Jul 20, 2010 at 1:20 AM, Michael Hudson
<michael.hudson@xxxxxxxxxxxxx> wrote:
> On 20/07/10 08:08, Robert Collins wrote:
>>
>> One of the ramifications of *all* of the proposed 'release features
>> when they are ready' workflows is more production rollouts. As such I
>> went over the proposed plan with James Troup looking for holes - we
>> can't increase downtime - we spend too much time down already :)
>
> As a general question, do you intend to change the 'all machines are running
> the same version of the code' idea, which is sort of vaguely nearly true at
> the moment?  I guess downtime-free upgrades basically imply it not being
> true, so we should be starting to design towards that...

Its not entirely honoured at the moment - we stagger appserver
upgrades, and also edge runs different code to production (which makes
it pretty hard to determine if 'error X' is due to the user
population, or the code base). So the overall goal is:
 - 1 deployed codebase
 - rev the deployed version when we've QA'd some more changes

The actual deploy-a-version process needs to continue to stagger
things, but it may - for some services - have to stagger on a period
of hours rather than minutes.

I would say that a single rollout is not *complete* until we're
running a single rev across the board - which means getting to the
point of being able to do graceful upgrades of the importds, buildds,
codehosting, jobs system - all the things that do multi-minute
operations.


>> As a result I've filed a few RT's to get redundant instances (probably
>> on the same servers) of things like the xmlrpc server, codehosting
>> server etc, so we can use one instance live and upgrade the other in
>> parallel.
>
> Yay.  Out of curiosity, what's the plan for doing the switcheroo for
> codehosting?  Using a tcp-level load balancer/port forwarder sounds easiest
> to me (and most useful to allow splitting off the bzr+ssh processes from the
> code host, something we should be thinking about in the medium term).

Ask James :) Something like that is my understanding, though the
bzr+ssh split off seems unrelated?

> In some sense, the importd system doesn't seem super high priority as the
> users don't interact with it directly.  Currently, import jobs tend to die
> during rollouts, which is certainly a bit inefficient but doesn't really
> have other consequences as the system is built to cope with jobs/import
> machines flaking out.

Ok, so the answer may be 'we interrupt those jobs when we're ready?

> The importds only interact with the rest of launchpad via the internal
> xml-rpc server and librarian, so load-balancing those services to get
> downtime-free upgrades[1] would mean that upgrades to the rest of Launchpad
> could be done without impacting the code import system. Semi-obviously,
> being the database being in read-only mode will tend to bring the system to
> a halt.

Right. Read only mode *is downtime*, and while its a necessary
facility from time to time, we should only invoke it when we need to.

> When it comes to updating code on the import machines themselves, I don't
> think the issue is very different to the issues you'll have with cronscripts
> on other machines.  It might be a bit harder because code import jobs can
> run for an hour or so[2], so the fairly clear approach of:

>  for each machine:
>   stop machine accepting new jobs
>   wait for currently running jobs to finish
>   upgrade code on machine
>   allow machine to accept new jobs
>
> would probably take a prohibitive amount of time for code imports.

Hours we could do if we had to, I think - automation and a dashboard
FTW. Days we can't, and I'd rather not do hours as a general case
anyhow.

> An approach where you installed the new code at a new path and didn't delete
> the old code until all jobs running from that tree finished would work fine.
>  I don't know how you tell all jobs running from a particular tree are
> finished though.

Can we change the code to make that clear somehow?

> Just upgrading the code under running jobs is probably low-risk but the idea
> does make me a bit uneasy.

Meep. No thanks ;)

> Changes to the protocol by which code import machines talk to the rest of
> launchpad would require a three step rollout (1: roll out addition of new
> protocol, 2: roll out code to use new protocol, 3: roll out removal of old
> protocol), but I think that's just life and something we can cope with.

Yes, thats exactly the point of this - to enable that sort of staged
process without waiting many weeks to go through each step - we should
be able to do such a transition in one day, so that we don't have
kludgy transitional code hanging around for extended periods.

> I think the issues with the jobs system in general is similar, although for
> every other job type there's just one machine that runs that type of job,
> and the other job implementations talk to the database directly.
>
> For the buildd-manager, I think it's actually fairly easy -- the manager
> itself is basically stateless, so assuming it has a way to orderly exit
> after a scan, I think:
>
>  install code for new buildd-manager
>  shut down old manager
>  start manager from new code
>  remove old code
>
> will be a low impact event.  You should check this with Julian though :-)

My understanding from James Troup is that the slaves go boom when the
tcp socket closes - I've filed a bug about this though.

Thanks for the feedback, its excellent to know a bit more about how
things are actively deployed. It sounds like there might be a code
change needed to make the importds easier to manage
transitions-of-code, perhaps you could file that?

Thanks,
Rob

Follow ups

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Julian Edwards, 2010-07-20
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Michael Hudson, 2010-07-20

References

rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-19
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Michael Hudson, 2010-07-19