launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #03790
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
On 20/07/10 08:08, Robert Collins wrote:
One of the ramifications of *all* of the proposed 'release features
when they are ready' workflows is more production rollouts. As such I
went over the proposed plan with James Troup looking for holes - we
can't increase downtime - we spend too much time down already :)
As a general question, do you intend to change the 'all machines are
running the same version of the code' idea, which is sort of vaguely
nearly true at the moment? I guess downtime-free upgrades basically
imply it not being true, so we should be starting to design towards that...
As a result I've filed a few RT's to get redundant instances (probably
on the same servers) of things like the xmlrpc server, codehosting
server etc, so we can use one instance live and upgrade the other in
parallel.
Yay. Out of curiosity, what's the plan for doing the switcheroo for
codehosting? Using a tcp-level load balancer/port forwarder sounds
easiest to me (and most useful to allow splitting off the bzr+ssh
processes from the code host, something we should be thinking about in
the medium term).
There are three key things that are not yet prepped for highly
available rollouts:
- cronscripts (probably including the job system)
- buildd master/slaves
- importds
I've filed a bug for the cronscripts as a whole and for the buildd's -
I had the temerity to mark these as high since we're going to be
impacting the ability for us to increase our velocity safely until
those are fixed.
I don't know enough about the job system or the importd system to
sensibly talk about highly available upgrades there yet. I'd love it
if someone were to just file bugs / RT's as appropriate to get such a
process in place - but failing that, I hope to discuss them with
whomever knows most in the next day or two.
In some sense, the importd system doesn't seem super high priority as
the users don't interact with it directly. Currently, import jobs tend
to die during rollouts, which is certainly a bit inefficient but doesn't
really have other consequences as the system is built to cope with
jobs/import machines flaking out.
The importds only interact with the rest of launchpad via the internal
xml-rpc server and librarian, so load-balancing those services to get
downtime-free upgrades[1] would mean that upgrades to the rest of
Launchpad could be done without impacting the code import system.
Semi-obviously, being the database being in read-only mode will tend to
bring the system to a halt.
When it comes to updating code on the import machines themselves, I
don't think the issue is very different to the issues you'll have with
cronscripts on other machines. It might be a bit harder because code
import jobs can run for an hour or so[2], so the fairly clear approach of:
for each machine:
stop machine accepting new jobs
wait for currently running jobs to finish
upgrade code on machine
allow machine to accept new jobs
would probably take a prohibitive amount of time for code imports.
An approach where you installed the new code at a new path and didn't
delete the old code until all jobs running from that tree finished would
work fine. I don't know how you tell all jobs running from a particular
tree are finished though.
Just upgrading the code under running jobs is probably low-risk but the
idea does make me a bit uneasy.
Changes to the protocol by which code import machines talk to the rest
of launchpad would require a three step rollout (1: roll out addition of
new protocol, 2: roll out code to use new protocol, 3: roll out removal
of old protocol), but I think that's just life and something we can cope
with.
I think the issues with the jobs system in general is similar, although
for every other job type there's just one machine that runs that type of
job, and the other job implementations talk to the database directly.
For the buildd-manager, I think it's actually fairly easy -- the manager
itself is basically stateless, so assuming it has a way to orderly exit
after a scan, I think:
install code for new buildd-manager
shut down old manager
start manager from new code
remove old code
will be a low impact event. You should check this with Julian though :-)
This effort ties into performance improvements as an enabler: the more
quickly we can deploy improvements, the faster we can react to timeout
issues, and thus the lower we can safely make the timeouts without
causing extended downtime for users. Its all about cycle time :)
Indeed. Fixing rollouts that involve database upgrades will be harder
though, I expect! First things first and all that, though.
Cheers,
mwh
[1] I think this is already done for the librarian?
[2] Several recent changes -- using bzr-svn, bzr-git performance
improvements, incremental imports, the increasing scarcity of new
requests for CVS imports -- have combined to make the import jobs that
take multiple hours or days much much rarer.
Follow ups
References