launchpad-dev team mailing list archive

Thread
Date

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: Michael Hudson <michael.hudson@xxxxxxxxxxxxx>
Date: Tue, 20 Jul 2010 17:25:27 +1200
Cc: launchpad-dev@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTimtAhauJgpyVtfKQ-AsD6s7RMd4jCRwPUr90wkx@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.10) Gecko/20100528 Thunderbird/3.0.5

On 20/07/10 16:39, Robert Collins wrote:

On Tue, Jul 20, 2010 at 1:20 AM, Michael Hudson
<michael.hudson@xxxxxxxxxxxxx>  wrote:

On 20/07/10 08:08, Robert Collins wrote:


One of the ramifications of *all* of the proposed 'release features
when they are ready' workflows is more production rollouts. As such I
went over the proposed plan with James Troup looking for holes - we
can't increase downtime - we spend too much time down already :)


As a general question, do you intend to change the 'all machines are running
the same version of the code' idea, which is sort of vaguely nearly true at
the moment?  I guess downtime-free upgrades basically imply it not being
true, so we should be starting to design towards that...


Its not entirely honoured at the moment


Right, that's what I meant by 'vaguely nearly true' :-)

- we stagger appserver
upgrades, and also edge runs different code to production

Also, cherry picks (and sometimes post release re-rolls) tend to only beto a limited number of machines. I guess with the new workflow cherrypicks will be a thing of the past -- this makes me happy :-)

(which makes
it pretty hard to determine if 'error X' is due to the user
population, or the code base). So the overall goal is:
  - 1 deployed codebase
  - rev the deployed version when we've QA'd some more changes

The actual deploy-a-version process needs to continue to stagger
things, but it may - for some services - have to stagger on a period
of hours rather than minutes.

I would say that a single rollout is not *complete* until we're
running a single rev across the board


That sounds like a good definition.

- which means getting to the
point of being able to do graceful upgrades of the importds, buildds,
codehosting, jobs system - all the things that do multi-minute
operations.

As a result I've filed a few RT's to get redundant instances (probably
on the same servers) of things like the xmlrpc server, codehosting
server etc, so we can use one instance live and upgrade the other in
parallel.


Yay.  Out of curiosity, what's the plan for doing the switcheroo for
codehosting?  Using a tcp-level load balancer/port forwarder sounds easiest
to me (and most useful to allow splitting off the bzr+ssh processes from the
code host, something we should be thinking about in the medium term).


Ask James :) Something like that is my understanding,

OK.

though the
bzr+ssh split off seems unrelated?


Yes, probably.  Certainly unrelated to this thread.

In some sense, the importd system doesn't seem super high priority as the
users don't interact with it directly.  Currently, import jobs tend to die
during rollouts, which is certainly a bit inefficient but doesn't really
have other consequences as the system is built to cope with jobs/import
machines flaking out.


Ok, so the answer may be 'we interrupt those jobs when we're ready?


Yes, that's probably reasonable for the import case.

The importds only interact with the rest of launchpad via the internal
xml-rpc server and librarian, so load-balancing those services to get
downtime-free upgrades[1] would mean that upgrades to the rest of Launchpad
could be done without impacting the code import system. Semi-obviously,
being the database being in read-only mode will tend to bring the system to
a halt.


Right. Read only mode *is downtime*, and while its a necessary
facility from time to time, we should only invoke it when we need to.

OK.

When it comes to updating code on the import machines themselves, I don't
think the issue is very different to the issues you'll have with cronscripts
on other machines.  It might be a bit harder because code import jobs can
run for an hour or so[2], so the fairly clear approach of:

  for each machine:
   stop machine accepting new jobs
   wait for currently running jobs to finish
   upgrade code on machine
   allow machine to accept new jobs

would probably take a prohibitive amount of time for code imports.


Hours we could do if we had to, I think - automation and a dashboard
FTW. Days we can't, and I'd rather not do hours as a general case
anyhow.

OK. I don't think it's reasonable to expect the above pseudo code totake less than an hour.

An approach where you installed the new code at a new path and didn't delete
the old code until all jobs running from that tree finished would work fine.
  I don't know how you tell all jobs running from a particular tree are
finished though.


Can we change the code to make that clear somehow?


I can't think of anything tasteful right now.  Do you have any ideas?

Just upgrading the code under running jobs is probably low-risk but the idea
does make me a bit uneasy.


Meep. No thanks ;)


Heh.  OK.

It occurs to me that the codehosting server has a slightly similarissue; you want to shut the old server down when its last connectioncloses. This is probably a bit easier though (the load balancer mightbe able tell you, or you can change the state of the ssh server throughsome control socket).

Changes to the protocol by which code import machines talk to the rest of
launchpad would require a three step rollout (1: roll out addition of new
protocol, 2: roll out code to use new protocol, 3: roll out removal of old
protocol), but I think that's just life and something we can cope with.


Yes, thats exactly the point of this - to enable that sort of staged
process without waiting many weeks to go through each step - we should
be able to do such a transition in one day, so that we don't have
kludgy transitional code hanging around for extended periods.


Sounds good to me.

I think the issues with the jobs system in general is similar, although for
every other job type there's just one machine that runs that type of job,
and the other job implementations talk to the database directly.

For the buildd-manager, I think it's actually fairly easy -- the manager
itself is basically stateless, so assuming it has a way to orderly exit
after a scan, I think:

  install code for new buildd-manager
  shut down old manager
  start manager from new code
  remove old code

will be a low impact event.  You should check this with Julian though :-)


My understanding from James Troup is that the slaves go boom when the
tcp socket closes - I've filed a bug about this though.

I find this a bit tricky to believe in general. The manager talksxml-rpc to the slaves, so there should be no persistent connection ingeneral (even if we're using pipelining by some perverse miracle, itshouldn't matter if the socket closes). I can believe that losing themanager at an arbitrary time would be bad, but exiting between scansshould be fine.

Thanks for the feedback, its excellent to know a bit more about how
things are actively deployed. It sounds like there might be a code
change needed to make the importds easier to manage
transitions-of-code, perhaps you could file that?


Let's have one more round of waffle first ;-)

Cheers,
mwh

Follow ups

Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Julian Edwards, 2010-07-20
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-20

References

rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-19
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Michael Hudson, 2010-07-19
Re: rollout changes to reduce downtime - linked to 'release features when they are ready'
From: Robert Collins, 2010-07-20