← Back to team overview

launchpad-dev team mailing list archive

Re: revamping 'monthly releases', and reminder to qa db patches

 

On Thu, Mar 31, 2011 at 5:31 AM, Julian Edwards
<julian.edwards@xxxxxxxxxxxxx> wrote:
> On Tuesday 29 March 2011 05:11:53 Robert Collins wrote:
>> What-do-you-all-think?
>
> I think that simplification is good.
>
> Can you summarise why we need downtime for DB change deployments?  Then
> perhaps we can work a plan to minimise or eliminate those altogether.

https://dev.launchpad.net/LEP/ReliableDBDeploys +
https://dev.launchpad.net/Database/LivePatching both have reasonable
technical information, but there is indeed no simple statement of
limitations.

In short 'concurrency hard'.

Slightly less shortly: slony 1.x (which we run) takes out an exclusive
lock on all relations in the replication set when performing DDL
(create table/alter table/ drop table - ditto for indices and views
etc etc). Postgresql also takes out exclusive locks on relations to do
most DDL. One notable exception to this is 'create index concurrently'
which can add an index live without ever taking an exclusive lock.

exclusive locks on single objects can be hard to acquire (because
we're a busy system - 70 requests a second from the lpnet cluster
alone - there may be no point where no transaction has a lock e.g.
product for several minutes (because of overlapping requests having
row level locks (which can happen when related tables are updated)).

exclusive locks on multiple objects are even harder to acquire, *and*
can also dead lock (if a client has a row lock in table A and goes for
one in table B, and we've got an exclusive lock on B already - boom,
deadlock.

One option would be a different replication strategy - e.g. slony2
takes out narrower locks for executing DDL, but will still take out
multiple exclusive locks because it locks all relations directly
affected by DDL (and sometimes gets it wrong and takes too few :().
However we'd still need to deal with postgresql itself taking
exclusive locks out; ditto for londista and postgresql 9's built in
replication.

What I think we need to do is design for *very fast* downtime: quiesce
the system in a few seconds, apply the DDL (and structure it to be a
few seconds long (*)), then bring the system back to activity in a few
more seconds. This implies a few things:
 - we cannot affort to stop or upgrade during this process
 - cronscripts doing multi-minute things need to be interruptible (or
use the LP API rather than db connections)
 - some web and api requests will be disrupted, so this process will
be a little disruptive (we can't be casual about it)

(*) short DDL implies:
 - when adding a column it must be NULLable, must not have a default
 - uniqueness on very big existing tables needs to be done via a
unique index (with CONCURRENTLY)
 - we'll learn more as we go. ..

HTH
Rob



References