← Back to team overview

launchpad-dev team mailing list archive

performance tuesday: pipelines and safe changes

 

This is another slightly-meta topic I'm afraid.

tl;dr / tl;wr:
 * QA other peoples code.
 * rollback immediately if something is unsafe to deploy. Do not wait
to fix it : the time it takes ec2 to validate your fix is too long.
 * Risky branches are considered harmful. Make them non-risky.
    (special note: lockstep changes where ('all of subject must be
verbed') considered harmful. Doing them is almost certain to cost more
than doing things incrementally.)

Our deployment process for a single revision is a pipeline:
5minutes    PQM code lands in trunk
5 hours       BUILDBOT -> tested by buildbot -> fails or
5 minutes   PQM code lands in stable
20 minutes DEPLOY -> deployment -> fails or
15 minutes QATAGGER ready for qa
???           HUMAN -> qa -> fails and rollback [by inserting a
rollback at the top] or
ready for deployment

Every time a step in the pipeline fails in such a way that we have to
start over (e.g. landing a rollback, restarting buildbot), we have to
pay the entire cost of the pipeline through to the point humans can
make a qa assessment again. This pipeline is truncated - I don't
include ec2 (because it doesn't interact with other attempted
landings), and I don't include deployment (because once a rev *can* be
deployed [that is, all its predecessors are good too] it is through
the pipeline and unaffected by subsequent landings.

So every time we land a change we have an expected overhead of 5.75
hours if nothing goes wrong. This is increased by things that might go
wrong - for instance, landing a branch that has a 50% chance of
failing in this pipeline increases the expected overhead: either 5.75
hours, or if it does fail, 11.5 hours, and a 50% chance of that
happening.

The key characteristic of this pipeline is that no item can complete
its path through the pipeline until the item before it has completed.

We land about 200 revisions a month - this has been pretty stable over
the last year - rev 11268 was one year back, and we're on 13558 now,
or 190 a month.

There are 20 work days in a month, or about 10 landings a day, or
0.416 per hour.

So *optimally* our system is going have just over 2 revisions entering
the pipeline in the time it takes one revision to traverse it.

Now, consider the impact of a failure in the front of the pipeline:
not only will we have to start over with a fix for the failure,
another 2 revisions will enter the pipeline while we do that.

If *either* fail, we have to fix them and start over, before we can
use the fix for the very first one that started this.

As an example, say we have nothing in the pipeline at all, and we
start with rev A, which is broken.
Time 0
Rev A lands
buildbot starts on rev A
Rev B lands
Rev C lands
Rev A is on qastaging.
Time 5.75
buildbot starts on rev C
Rev A marked bad.
Rev D, a rollback for Rev A lands.
Rev C is on qastaging
Time 11.5
buildbot starts on Rev D
Rev B marked bad
and so it goes until we rollback *and* the new incoming revisions have
no failures of their own.

Sadly, I suspect this pattern will seem all to familiar to anyone who
has been doing deployment requests and looking at our deployment
reports.

So, - on average - 2 new revisions every pass through the cycle, if
our expected failure rate were to be above 50%, we would only have 25%
chance of stablising and being ready for a deploy.

These are the independent variables we are dealing with:
 expectedfailurerate
 length of pipeline

There is a dependent variable:
 # of unknown revisions between a known bad revision and its fix
(whether that is a rollback or a fix AKA rollforward)


Changing the minimum length of the pipeline in a meaningful way
requires a massive improvement in test suite timings, which people
care about but isn't resourced *yet*.
Note however that the length of the pipeline extends indefinitely when
we have delays in QA.

So the only things we can control are the expected failure rate for
landings, and the amount of delay between a revision which might be
bad being QA'd [and rollbacked if needed].

Because the minimum pipeline length is nearly 6 hours, we *should
expect* that we cannot qa our own code except when we land it first
thing in our mornings... Depending on self-qa would make our pipeline
16 hours long (end of one day to the start of the next) at best.

Recently we've had a particularly hard time getting to a deployable
state, and I think it has been due to a higher than regular failure
rate for branches at and post-epic.

We need to be quite sensitive to increased risk in branch landings, or
we get into this unstable state quite easily. The higher the risk of
failure, the greater the risk of a 5.75 hour stall.

Note that this isn't a 'work harder' problem: we can never be totally
sure about a branch; that is why we do QA.

Instead, this is a 'when deciding how to change something, avoid
choices that incur unnecessary risk' : whats necessary is an engineers
choice.

Some examples that come to mind:
 - incompatible (internal or web) API changes: if a change breaks
stuff in your local branch, it may break stuff in other peoples
branches, or *untested* stuff in your branch.
   - make the change compatibly: e.g. add a new attribute rather than
redefining the existing one.
 - disk layout changes (e.g. where js files are compiled to etc)
   - check that merging the branch to an existing pre-used working
dir, and running 'make run', 'make test' etc don't fallover with bad
dependencies, missing files etc.

The general approach is very similar to what we are now doing with
schema changes to get low latency schema deploys: make the individual
change simpler, doing only the work that is safe to do, and then
cleaning up later.

-Rob


Follow ups