launchpad-dev team mailing list archive

Thread
Date

Re: The future of downtime for rollouts?

To: launchpad-dev@xxxxxxxxxxxxxxxxxxx
From: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>
Date: Wed, 29 Sep 2010 16:07:02 -0400
Cc: Robert Collins <robert.collins@xxxxxxxxxxxxx>
In-reply-to: <AANLkTimojBDgTWseM9X0SnYEJebteK4ot50LDu7x_a8P@mail.gmail.com>
Organization: Canonical Ltd.
User-agent: KMail/1.13.5 (Linux/2.6.32-25-generic; KDE/4.5.1; x86_64; ; )

I like the idea of a fixed downtime budget a lot.

What do we need to make it happen?

My thoughts:

  * A feedback mechanism allowing us to track how much of the budget is spent.   
    * Something like a script that extracts a measure from the staging update?
  * Once the budget is spent, no more DB changes.
  * We probably need a way to input the budget for case where a OS maintenance 
    task will shrink the DB budget.


-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

On September 14, 2010, Robert Collins wrote:
> On Wed, Sep 15, 2010 at 12:46 AM, Curtis Hovey
> 
> <curtis.hovey@xxxxxxxxxxxxx> wrote:
> > On Tue, 2010-09-14 at 11:41 +0100, Tom Haddon wrote:
> >> Might be more reliable but less accurate :) We estimate the downtime
> >> based on how long the last update took on staging, and then
> >> multiplying
> >> by a factor that seems to have accurately reflected the difference in
> >> time between staging and production (with a little padding). We could
> >> only commit to 90 mins if we refused to rollout any DB updates that
> >> took
> >> longer than a certain period of time on staging.
> > 
> > Staging restore times trend up, so we are always talking about
> > increasing time for a rollout. We will continue to do schema development
> > after the featureflag is complete. What we cannot see is the staging
> > restore time verses the real time--maybe that is pointless because there
> > are other rollout incidents that increased the rollout.
> 
> Staging restore times as a whole are a poor surrogate as already discussed.
> 
> The point I am making is that unless we decide *how much downtime we
> will tolerate*, we'll always have reasons to do more.
> 
> So I'm proposing:
>  - a 90m budget.
>  - if we can't do it in that timeframe, we don't do it.
> 
> We *will have to* innovate and address various issues to stick to
> this, but 90m of time is actually a lot of lost time to the many
> thousands of users we have in every timezone. We could spend a week
> with the whole team working on something to make the upgrade faster,
> and still be spending less time than our users are losing, when we're
> down.
> 
> -Rob
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp

Attachment: signature.asc
Description: This is a digitally signed message part.

Follow ups

Re: The future of downtime for rollouts?
From: Martin Pool, 2010-09-30
Re: The future of downtime for rollouts?
From: Robert Collins, 2010-09-30

References

The future of downtime for rollouts?
From: Jonathan Lange, 2010-09-14
Re: The future of downtime for rollouts?
From: Curtis Hovey, 2010-09-14
Re: The future of downtime for rollouts?
From: Robert Collins, 2010-09-14