← Back to team overview

launchpad-dev team mailing list archive

Re: The future of downtime for rollouts?

 

On Tue, Sep 14, 2010 at 9:55 PM, Jonathan Lange <jml@xxxxxxxxxxxxx> wrote:
> Hello,
>
> I've noticed that negotiating the downtime for Launchpad rollouts is
> becoming increasingly tricky.
>
> So I can be clear when asked,
>  * what's the downtime for rollout now?
>  * are we doing anything to reduce it?
>  * when are we expecting to have zero downtime for rollout?
>
> I'll put the answer on a wiki somewhere once the thread winds up.

I have a few thoughts here.

The current process, AIUI goes like this:
 - the RM asks the LOSAs and stub the needed downtime.
 - they estimate it via various arcane methods(*)
 - that is then used for the announcement.

Short term:
Perhaps it would be better to say:
'we have a 90 minute downtime window each release. Always 90 minutes,
and never more than.'

Long term:
ReleaseFeaturesWhenTheyAreDone has as part of its incremental rollout
no-perceived downtime for *everything* except DB schema patches.
Schema patches we are crippled on due to the locking needs of slony.

Once RFWTAD -or- the first of my performance goals (5 seconds) is
reached, I'll have another kanban slot and will be putting database
agility in that slot.

U1 are researching Cassandra in depth - that team may be organising a
meeting to get across it with one of the core devs; if that goes ahead
I hope to attend (and I hope Stuart and perhaps Gary will too) - but
nothing is actually arranged yet - I got this in a brief chat with
Elliot this morning. And to be clear: I've got no particular solution
in mind, other than the criteria: same or better write & read scaling;
same or better high availability; something compelling on data
integrity; same or better performance.

(*): Joking.

-Rob



Follow ups

References