launchpad-dev team mailing list archive

Thread
Date
Re: RFC: Is readonly mode fixable, or should we ditch it entirely?

To: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
From: Martin Pool <mbp@xxxxxxxxxxxxx>
Date: Mon, 13 Jun 2011 19:07:28 -0700
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <BANLkTi=a4Hqg+KoZXJ_7-nLFKYD6fa1-bw@mail.gmail.com>
Sender: martinpool@xxxxxxxxx
That sounds great.

Readonly mode has some substantial drawbacks, so if anything can be
gained by getting rid of it, I say go ahead

 * actually doing lp-related work generally requires write access
 * users can be led up the garden path (and not in a good way) if they
start doing work without noticing they can't commit it to lp
 * code hosting, the API, and perhaps other service points don't
understand readonly mode, and fail in a way that is no better than
being down -- it would be better to give clean errors when it actually
is down.  (for the api, the fault might lie with lplib not lp itself)
 * there are Google and archive.org caches of pages if you really need
to see them
 * heavy users are likely to have their key data offline in bug/review
mail already
 * there are a bunch of bugs about readonly mode, and more turn up
from time to time; dealing with them is a waste

Martin




On 13 June 2011 18:50, Robert Collins <robertc@xxxxxxxxxxxxxxxxx> wrote:
> Of the many performance problems we're working on, the downtime deploy
> process is particularly important: while downtime deploys are slow, we
> cannot do them frequently. While we cannot do them frequently, folk
> have to work off on db-devel, leading to repeated merge conflicts,
> difficulty in delivering completed work, and extreme time lapses
> between completing something and being able to consider it done.
>
> The latency exacerbates our downtime window - folk don't want to run
> background migrations if they can avoid it, because they have to wait
> to land the follow on code for both the schema change *and* the
> migration to happen. This makes downtime deploys riskier, more likely
> to fail and so compounds the issue.
>
> When we are applying schema changes we use a thing we call 'readonly'
> mode to provide some services while the schema change is being
> applied.
>
> readonly mode is expensive though. Currently the first 15-20 minutes
> of user visible downtime are entirely involved with switching across
> to readonly mode. After the deploy all looptuner scripts are blocked
> for 24 hours while we rebuild the readonly slave, and the restoration
> of services is time consuming while we bring enough capacity online to
> handle our user load. We have to bring up new appservers because we're
> always deploying changes which are incompatible with the python schema
> definitions we have.
>
> Now, there are some things we can do to make getting into readonly
> mode faster : we can use stub's ini file to signal readonly mode as
> soon as an appserver reads the file - say once every 2 seconds or some
> such. If we let existing requests complete we'd be looking at ~22
> seconds to get into readonly mode. This won't address the need to
> bring up new appservers, nor the overhead of rebuilding the readonly
> replica.
>
> Those overheads shouldn't be underestimated: even if we get the entry
> into readonly mode down to 30 seconds, the readonly replica breakout
> is still serialised,  and if we were to do a 60second downtime once a
> week then a full 1/7th of the time we would be busy rebuilding
> replicas. We can't do a re-spin of a failed schema change in less than
> 24 hours after a previous one. The overheads of switch appserver
> instances are also non-trivial, it takes nearly an hour to do a
> nodowntime deploy, and similarly in readonly mode as we tear down the
> old readonly instances and bring up the new ones.Total downtime we're
> looking at is schema patch time + ~4-5 minutes.
>
> There is another way to tackle the problem though: we can just switch
> off access to the database cluster, apply schema changes, and reenable
> access. Most schema changes are things we can do pretty easily without
> breaking compatibility with the python code[1]. The ones that are
> harder we can probably still do with a little thought.
>
> In this model, a downtime deploy would be as follows:
>  - ~30 minutes before shutdown scripts that can't be inerrupted
>  - @T=0 disable access to the database
>  - apply the schema change
>  - reenable access to the db
>
> No appserver bouncing, nothing. We could show an error page on the
> appservers during this time - we can iterate to make that pretty, and
> we could use the aforementioned ini file to tell the appservers that a
> schema change is going on. Total downtime: schema-patch time + ~60
> seconds.
>
> In terms of development process we would land schema changes that are
> compatible with the python code on devel - just the schema change, no
> python code changes at all. Then we'd do nodowntime deploys as normal
> up to and past that revision; when a good time to do the downtime
> arrives (e.g. we might set a fixed time of the day or week) we'd do
> the downtime deploy described above. After thats live developers would
> then land code that uses the new schema / populates new columns etc.
>
> If we did have something that required new appservers to be deployed, we'd:
>  - do a nodowntime deploy up to the revision with the schema change in it
>  - decide when we would be doing the downtime and prevent all deploys
> until that time
>  - then do a hybrid deploy:
>   - ~30 minutes before shutdown scripts that can't be interrupted
>   - stage the new code
>   - @T=0 disable access to the database
>   - in parallel (massively parallel - do all appservers and the
> schema all at once)
>     - apply the schema change
>     - kill appservers with -9 and restart them with the new code base
>   - reenable access to the db
>
> But I think we'd want to aggressively avoid such scenarios as being
> harder and more complex to execute on, as well as having more
> downtime. (Total downtime will be schema time + 4-5 minutes as per
> readonly mode). The ability to do the schema change early means we can
> stop bundling the new python code with the schema change without
> making folk have to carry the stuff in an unmerged branch for extended
> periods of time (which readonly mode requires).
>
> Have I missed some way we can mitigate the costs of readonly mode, or
> something we'd have to have present if we ditched it?
>
> If I haven't, I propose that we:
>  - consult with our users and see if they have concerns or ideas we
> haven't considered
>  - get stakeholder buy in (I'm going to forward this to the
> stakeholders now to start discussion)
>  - identify the core facilities we need to move to this process
>  - stop bundling schema & python changes immediately.
>
> We could, if we want, start landing schema things live on devel next week.
>
> -Rob
>
> 1]: We don't split the schema change and python definitions today
> because there is little benefit to doing so. But if doing so gets us
> shorter cycle times, then we will get benefits.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp
>
>
Follow ups

Re: RFC: Is readonly mode fixable, or should we ditch it entirely?
From: Stuart Bishop, 2011-06-14
References

RFC: Is readonly mode fixable, or should we ditch it entirely?
From: Robert Collins, 2011-06-14