launchpad-dev team mailing list archive

Thread
Date
performance tuesday: fast downtime is go!

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
Date: Wed, 13 Jul 2011 12:39:38 +1200
I try to write about application performance in these emails, but
today I'm going to exercise some editorial leeway and instead write
about development performance - specifically about the latency by
which schema changes get deployed to the production environment.

This mail is a little long, and there is a very important question in
it, so the tl;dr version first:

 - if you know of a script we have, which if its database connection
is interrupted requires manual fixup afterwards, please let stuart or
I know.

 - For *all* new DB patches please aim for 10-15 second *total*
application time on staging/qastaging. For assistance achieving that
you're welcome to tap stuart or I. As of this week slow patches (> 15
seconds) will require signoff by me or Francis, as they will cause
custom excessive downtime.

 - *no* more changes to both DB code and Python code at the same time:
This applies to both devel and db-devel.

 - There will be some disruption with staging.launchpad.net as this
project is worked on and tested.



Now, for the actual content :)... Its likely that everyone that cares
about this already knows our current process, but I'm going to
summarise it now anyway for clarity.

The process today is:
 - we decide to make a change (e.g. to improve performance, add a
feature, whatever)
 - a patch is prepared (involves one or more devs, the dba and/or ta)
 - patch is reviewed and categorised as being apply-live or apply-cold
 - apply-live patches
(https://dev.launchpad.net/Database/LivePatching) land on
lp:launchpad/devel
   - these are then applied live by a sysadmin or dba, before the
matching code change can be deployed.
 - apply-cold patches land on lp:launchpad/db-devel
   - these then stall for 2 weeks on average, until the next monthly downtime

Monthly downtime involves:
 - shutting everything down/into read only mode
 - doing a deploy of new code to all the servers
 - breaking the read only replica
 - applying any pending db patches
 - starting up all the servers
 - zapping and rebuilding the read only replica

We have about an hour overhead in shutting down and starting up the
appservers, plus the actual db patch application itself. In addition
there is an hour before the downtime where we quiesce background tasks
like email sending, archive publishing and so on.

As a whole this process sits in an awkward position optimization wise:
while the patch application occurs infrequently, we have lots of
patches queue up; while there are lots of patches, the non-application
overhead is large but still in the same order-of-magnitude duration.
while the non-application overhead is large we cannot do the downtime
more frequently: its a self reinforcing situation. Making it worse, a
lot of the overhead we have is related to making the long (up to 90
minutes) downtime tolerable, and is itself a frequent cause of delays
in the application process (for instance, breaking out the readonly
replica often crashes the replication software).

The aggregate impact of 2 week delays on work landing is pretty
significant, and so fixing this is a high priority for addressing our
development cycle time.

I handwaved a leaner approach a few weeks ago, discarding as much
overhead as possible to make the downtime as close to actual
application time as possible, and doing patches one at a time to
minimise batching effects. This is now covered in
https://dev.launchpad.net/LEP/FastDowntime. Stuart has been busy
translating my handwave into concrete possibilities.

This is deliberately -lean- - no frills, no bells, and potentially
very ugly. I that we can iterate rapidly once the basic facility is in
place.

Now, the new process is:
- we decide to make a change (e.g. to improve performance, add a
feature, whatever)
 - a patch is prepared (involves one or more devs, the dba and/or ta)
 - patch is reviewed and categorised as being apply-live or apply-cold
 - apply-live patches
(https://dev.launchpad.net/Database/LivePatching) land on
lp:launchpad/devel
   - these are then applied live by a sysadmin or dba, before the
matching code change can be deployed.
 - apply-cold patches land on lp:launchpad/devel
   - these are then applied with a fast downtime process

The fast downtime involves:
 - ~1 hour before queisce background tasks
and then at the time
 - preventing new connections to the db except from the patcher
 - check that no white-listed DB users are in the middle of
transactions (and if there are, abort)
 - kicking all connections off the db servers
 - applying the patches
 - allowing new connections back in

Now, if there are scripts that will need manual repair/recovery if
they are kicked out of the DB, we need to know - to whitelist their
users. Normally they should be quiesced, but if something goes wrong
with that, this is a fallback step.

(To whitelist them, just reply to this thread with the script user.)

It will take a bit of time to get all the pieces together and working,
but I'm reasonably sure that we'll have it ready to roll before the
next monthly downtime window would have come around; so we'll not do
that window: instead, we'll pull patches that have accrued on db-devel
into devel one at a time using this new process.

Once things are mature enough, we can disable db-devel altogether, but
we're not there *yet*.

I'll be following up this mail with a bit less detailed one for
-users, and possibly a blog post.

Cheers,
Rob
Follow ups

Re: performance tuesday: fast downtime is go!
From: Francis J. Lacoste, 2011-07-13