← Back to team overview

launchpad-dev team mailing list archive

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'

 

On 20/07/10 08:08, Robert Collins wrote:
One of the ramifications of *all* of the proposed 'release features
when they are ready' workflows is more production rollouts. As such I
went over the proposed plan with James Troup looking for holes - we
can't increase downtime - we spend too much time down already :)

As a general question, do you intend to change the 'all machines are running the same version of the code' idea, which is sort of vaguely nearly true at the moment? I guess downtime-free upgrades basically imply it not being true, so we should be starting to design towards that...

As a result I've filed a few RT's to get redundant instances (probably
on the same servers) of things like the xmlrpc server, codehosting
server etc, so we can use one instance live and upgrade the other in
parallel.

Yay. Out of curiosity, what's the plan for doing the switcheroo for codehosting? Using a tcp-level load balancer/port forwarder sounds easiest to me (and most useful to allow splitting off the bzr+ssh processes from the code host, something we should be thinking about in the medium term).

There are three key things that are not yet prepped for highly
available rollouts:
  - cronscripts (probably including the job system)
  - buildd master/slaves
  - importds

I've filed a bug for the cronscripts as a whole and for the buildd's -
I had the temerity to mark these as high since we're going to be
impacting the ability for us to increase our velocity safely until
those are fixed.

I don't know enough about the job system or the importd system to
sensibly talk about highly available upgrades there yet. I'd love it
if someone were to just file bugs / RT's as appropriate to get such a
process in place - but failing that, I hope to discuss them with
whomever knows most in the next day or two.

In some sense, the importd system doesn't seem super high priority as the users don't interact with it directly. Currently, import jobs tend to die during rollouts, which is certainly a bit inefficient but doesn't really have other consequences as the system is built to cope with jobs/import machines flaking out.

The importds only interact with the rest of launchpad via the internal xml-rpc server and librarian, so load-balancing those services to get downtime-free upgrades[1] would mean that upgrades to the rest of Launchpad could be done without impacting the code import system. Semi-obviously, being the database being in read-only mode will tend to bring the system to a halt.

When it comes to updating code on the import machines themselves, I don't think the issue is very different to the issues you'll have with cronscripts on other machines. It might be a bit harder because code import jobs can run for an hour or so[2], so the fairly clear approach of:

 for each machine:
   stop machine accepting new jobs
   wait for currently running jobs to finish
   upgrade code on machine
   allow machine to accept new jobs

would probably take a prohibitive amount of time for code imports.

An approach where you installed the new code at a new path and didn't delete the old code until all jobs running from that tree finished would work fine. I don't know how you tell all jobs running from a particular tree are finished though.

Just upgrading the code under running jobs is probably low-risk but the idea does make me a bit uneasy.

Changes to the protocol by which code import machines talk to the rest of launchpad would require a three step rollout (1: roll out addition of new protocol, 2: roll out code to use new protocol, 3: roll out removal of old protocol), but I think that's just life and something we can cope with.

I think the issues with the jobs system in general is similar, although for every other job type there's just one machine that runs that type of job, and the other job implementations talk to the database directly.

For the buildd-manager, I think it's actually fairly easy -- the manager itself is basically stateless, so assuming it has a way to orderly exit after a scan, I think:

 install code for new buildd-manager
 shut down old manager
 start manager from new code
 remove old code

will be a low impact event.  You should check this with Julian though :-)

This effort ties into performance improvements as an enabler: the more
quickly we can deploy improvements, the faster we can react to timeout
issues, and thus the lower we can safely make the timeouts without
causing extended downtime for users. Its all about cycle time :)

Indeed. Fixing rollouts that involve database upgrades will be harder though, I expect! First things first and all that, though.

Cheers,
mwh

[1] I think this is already done for the librarian?

[2] Several recent changes -- using bzr-svn, bzr-git performance improvements, incremental imports, the increasing scarcity of new requests for CVS imports -- have combined to make the import jobs that take multiple hours or days much much rarer.



Follow ups

References