launchpad-dev team mailing list archive

Thread
Date

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'

To: launchpad-dev@xxxxxxxxxxxxxxxxxxx
From: Michael Hudson <michael.hudson@xxxxxxxxxxxxx>
Date: Tue, 20 Jul 2010 11:20:56 +1200
In-reply-to: <AANLkTim-NU0y-p4QZvrRhnPaQaqOKb3q6x_SrRZldCDi@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.10) Gecko/20100528 Thunderbird/3.0.5

On 20/07/10 08:08, Robert Collins wrote:

One of the ramifications of *all* of the proposed 'release features
when they are ready' workflows is more production rollouts. As such I
went over the proposed plan with James Troup looking for holes - we
can't increase downtime - we spend too much time down already :)

As a general question, do you intend to change the 'all machines arerunning the same version of the code' idea, which is sort of vaguelynearly true at the moment? I guess downtime-free upgrades basicallyimply it not being true, so we should be starting to design towards that...

As a result I've filed a few RT's to get redundant instances (probably
on the same servers) of things like the xmlrpc server, codehosting
server etc, so we can use one instance live and upgrade the other in
parallel.

Yay. Out of curiosity, what's the plan for doing the switcheroo forcodehosting? Using a tcp-level load balancer/port forwarder soundseasiest to me (and most useful to allow splitting off the bzr+sshprocesses from the code host, something we should be thinking about inthe medium term).

There are three key things that are not yet prepped for highly
available rollouts:
  - cronscripts (probably including the job system)
  - buildd master/slaves
  - importds

I've filed a bug for the cronscripts as a whole and for the buildd's -
I had the temerity to mark these as high since we're going to be
impacting the ability for us to increase our velocity safely until
those are fixed.

I don't know enough about the job system or the importd system to
sensibly talk about highly available upgrades there yet. I'd love it
if someone were to just file bugs / RT's as appropriate to get such a
process in place - but failing that, I hope to discuss them with
whomever knows most in the next day or two.

In some sense, the importd system doesn't seem super high priority asthe users don't interact with it directly. Currently, import jobs tendto die during rollouts, which is certainly a bit inefficient but doesn'treally have other consequences as the system is built to cope withjobs/import machines flaking out.

The importds only interact with the rest of launchpad via the internalxml-rpc server and librarian, so load-balancing those services to getdowntime-free upgrades[1] would mean that upgrades to the rest ofLaunchpad could be done without impacting the code import system.Semi-obviously, being the database being in read-only mode will tend tobring the system to a halt.

When it comes to updating code on the import machines themselves, Idon't think the issue is very different to the issues you'll have withcronscripts on other machines. It might be a bit harder because codeimport jobs can run for an hour or so[2], so the fairly clear approach of:


 for each machine:
   stop machine accepting new jobs
   wait for currently running jobs to finish
   upgrade code on machine
   allow machine to accept new jobs

would probably take a prohibitive amount of time for code imports.

An approach where you installed the new code at a new path and didn'tdelete the old code until all jobs running from that tree finished wouldwork fine. I don't know how you tell all jobs running from a particulartree are finished though.

Just upgrading the code under running jobs is probably low-risk but theidea does make me a bit uneasy.

Changes to the protocol by which code import machines talk to the restof launchpad would require a three step rollout (1: roll out addition ofnew protocol, 2: roll out code to use new protocol, 3: roll out removalof old protocol), but I think that's just life and something we can copewith.

I think the issues with the jobs system in general is similar, althoughfor every other job type there's just one machine that runs that type ofjob, and the other job implementations talk to the database directly.

For the buildd-manager, I think it's actually fairly easy -- the manageritself is basically stateless, so assuming it has a way to orderly exitafter a scan, I think:


 install code for new buildd-manager
 shut down old manager
 start manager from new code
 remove old code

will be a low impact event.  You should check this with Julian though :-)

This effort ties into performance improvements as an enabler: the more
quickly we can deploy improvements, the faster we can react to timeout
issues, and thus the lower we can safely make the timeouts without
causing extended downtime for users. Its all about cycle time :)

Indeed. Fixing rollouts that involve database upgrades will be harderthough, I expect! First things first and all that, though.


Cheers,
mwh

[1] I think this is already done for the librarian?

[2] Several recent changes -- using bzr-svn, bzr-git performanceimprovements, incremental imports, the increasing scarcity of newrequests for CVS imports -- have combined to make the import jobs thattake multiple hours or days much much rarer.

Follow ups

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Julian Edwards, 2010-07-20
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-20

References

rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-19