← Back to team overview

launchpad-dev team mailing list archive

ratcheting up the frequency with which we do FDT's

 

So fast downtime has been pretty successful: reliable - no process
failures, one slony bug found, and one bad patch that required
scrambling afterwards (and lessons learnt by stub and I so we'll catch
future glitches of such nature).

However, teams such as disclosure have found the one patch a day limit
quite... limiting. Not on average - we've done ~ 1 patch a week - but
the patches cluster, and being able to do them more willy-nilly would
be good.

So, as some of you have been nagging for :P - I've raised with the LP
stakeholders us doing more frequent FDT's - and had no objections.

The new schedule is:
~0200UTC for 10 seconds.
~1000UTC for 10 seconds.
~1800UTC for 10 seconds.
Starting on the first slot in Monday Asia-Pac business hours and
finishing up in the last slow that is within Friday Asia-Pac business
hours (to avoid weekend-fallout from a bad patch). The existing
constraints from Ubuntu to avoid FDT during release process deadline
days is unchanged.

This schedule will give us a max of 30 seconds downtime per day, a
significant reduction from the current maximum of 5 seconds, and
similarly less than the effective downtime of 60-90 seconds that we
had with Slony.

We have ~6 seconds of overhead in the new slony-free patch algorithm.
So to meet the new 10 second downtime duration we'll need to be
confident each patch will do <=4 seconds work. This means a much lower
heuristic threshold for accepting in-patch processing of data,
including:
 - creation of indices
 - population of table rows
 - verification of constraints

As before, if the patch can complete in the time budget (4 seconds)
while doing such work on staging, it is ok to do it. My current guess
is that doing a batch update or index on some hundreds of rows will be
ok, but that anything beyond 500 or so will be risky.

I will update our wiki pages etc tomorrow, but this new schedule - and
the tighter QA rule of 4 seconds vs 15 seconds - takes effect
immediately.

As usual, this isn't set in stone, and I'll be delighted if there are
further improvements we can make (or problems that haven't been
identified).

Cheers,
Rob


Follow ups