Launchpad logo and name.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index ][Thread Index ]

Re: reducing Launchpad downtimes?



On 21/02/2008, Martin Pool <mbp@xxxxxxxxxxxxx> wrote:
> Launchpad was down for a bit under three hours today.  As I recall it
>  was similar last month.  I realize most of you were asleep, but it was
>  the middle of the work day in Australia and other places.  (So I was
>  forced to go and ride my motorcycle, how sad ;-)  A few irc users
>  commented on it.
>
>  I'm told the downtime really is the downtime it takes to do the
>  database changes, so there's no easy answer.  But as we want to be a
>  really great and very reliable collaboration platform, and to still do
>  updates at frequent intervals, I think this is something to think very
>  hard about for later cycles.
>
>  I believe the heavy lifting in this upgrade was to improve
>  translations performance, which I'm sure will be pleasing to many
>  users.  But it's a bit stiff that this stops people using code
>  hosting, bugs, or PPAs.

Apparently we also switched to a faster database server machine this
cycle, which accounted for the length of the downtime.  This has the
potential to reduce the downtime of future software upgrades (although
it is still worth discussing how to decrease them further).


>  Some (possibly naive) ideas:
>
>   * split things so that you take down just the translations app while
>  its data is being migrated, leaving other apps running

There is a lot of overlap between applications (e.g. Person, Product,
Distribution, etc), so this is not as easy as it might seem.


>   * add an abstraction layer so that db changes need not be strictly
>  synchronized with code changes
>   * run in readonly mode against a copy of the database

This one has been on the todo list for a while.

>   * perhaps this is crazy but why not let people just keep trying to
>  use it, and fail any particular request that can't succeed, with a
>  clear message?

This could be actively harmful and cause the updates to fail and need
to be retried, which would increase the length of time the update
takes to complete.

>   * at least, give more warning within the app itself (as we discussed
>  recently; I really think this should be an urgent priority.)

This would be a good idea.  For something that is done on a monthly
basis at a planned time, we really should be giving more advanced
notice in more prominent locations.


>  I presume there is some prior art here...
>
>  I guess you have talked about these before.  I realize it is not easy,
>  and there is lots of bugs and feature work to do.  However, for the
>  kind of promises we're making, or wanting to make, to our users,
>  Launchpad needs better than 99.5% uptime.


James.




This is the launchpad-users mailing list archive — see also the general help for Launchpad.net mailing lists.

(Formatted by MHonArc.)