launchpad-dev team mailing list archive

Thread
Date

operational excellence: handling of deployment friction and failing backend services (including cronscripts)

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Wed, 19 Jan 2011 13:13:37 +1300
Sender: robertc@xxxxxxxxxxxxxxxxx

Just had a few interesting conversations which have lead to a slight
tweak to importance of some issues; I'm not sure if these are in
written policies or not.

As background, Francis wants to achieve a 45 calendar day turnaround
on new requests. This is significantly constrained by db deployments.
That is, if a feature needs an additional db patch after users get
some experience with it (and we commonly need this) then our shortest
path is avg-time-to-db-deploy * 2 + time to polish. To rectify this we
should *aim* to be able to do db deployments as soon as we have a qa'd
db change.

Thus, the proposal is:
 - deployment friction bugs are very important - possibly even
critical. (Use judgement of course).

With a long term goal of having deployments have no lead 'semi-down'
time, and likewise be really available immediately after the db patch
is applied. The actual downtime should be precisely the db application
time - e.g. we should be down for only a few minutes each time.

Separately, failing backend services - e.g. update-karma which isn't
running at the moment - may not log OOPS, but we (Francis and I)
consider them equivalent to an OOPS in severity, so these should be
critical too. If they could log an OOPS, they would.

Cheers,
Rob