← Back to team overview

launchpad-dev team mailing list archive

fastdowntime db deployment update

 

Hi.

Earlier today we did a successful test run of the fastdowntime
database deployment process.

At 2011-09-08 08:43 we shutdown the soyuz systems
At 2011-09-08 08:53:05 we entered 'downtime' and made a no-op update,
running just trusted.sql and resetting security.
At 2011-09-08 08:55:17 the outage completed. 2 minutes, 12 seconds.

All live systems failed during the outage as expected, generating an
OOPS storm. They all successfully recovered after the outage.

The trial tells us that the process seems solid enough, and that our
overhead for applying database patches is about 2 minutes and 12
seconds. The most significant optimization we can make to reduce this
overhead is to switch to Slony-I 2.0 (we are currently running 1.2,
but U1 is already running 2.0 happily).

We are scheduled for a real run 2011-09-09 08:30 UTC, applying our
backlog of database patches. The outage should be well within our 5
minute window. It is not yet known which parts of soyuz will be kept
live during the update, and which fragile parts will be shutdown for a
longer period.

I think these are the main issues raised:

* 'long running transaction' threshold was too low. Bug #844616

* There are rogue archive-publisher connections, likely from command
line tools being run interactively. These need to connect as different
database users so we can identify and deal with them appropriately,
rather than just refuse to kick off the process as archive-publisher
has been deemed a 'fragile user'.

* People hate being presented with an OOPS screen on the main
appserver during the outage. Bug #844631

I expect the next issue to be raised is how to cope with the OOPS
storm in our reports. I think we need to inform the reports about
outages and get them to ignore OOPSes during this window, and generate
a special report for the outage windows to ensure that systems are
failing the way they should be failing.

Thanks for everyone who helped get us this far, particularly all the
patch and buildbot wranglers who untangled the web of rollbacks and
rollbacks of rollbacks and helped get qa'd revisions of code onto
servers so we could finally do this.

-- 
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/


Follow ups