← Back to team overview

launchpad-dev team mailing list archive

Re: staging was down last week & weekend

 

On Thu, 2009-08-20 at 10:06 +1000, Martin Pool wrote:
> When I was trying to use it in Taipei for demos, staging seemed to be
> down quite a lot over Friday and the weekend.  Was that a known issue?

Eventually...
The thread in question is "make not working in devel"

The fail over the weekend, as best I could determine, was a follow on
from the original. Essentially the staging restore process left the app
services in gaga land. The process listing was a mess - would be another
description. :-/
Excess mailman processes that only responded to a hard kill.
The App (etc) server itself recovered after a straight stop/start cycle.


>  I couldn't find anything on the list in a brief scan.  What's the
> right escalation for such an outage?

I'd suggest there isn't one and shouldn't be one, at least outside core
hours. Within core hours the process is well established: we get alerts
and respond based on priority need.

Treating it as a production system would entail needing another another
system that could trial run these updates - and break - in the current
automated fashion. ie staging-production vs staging-staging. Or bring
back the rarely updated demo system.


Following on from this particular staging issue, we have added an RT
task against ourselves to add a "die! Die! DIE!!!!" loop to the init.d
scripts for all the launchpad services. It's excessive, but appears
necessary.


Cheers!
- Steve





Follow ups

References