← Back to team overview

launchpad-dev team mailing list archive

Re: 5, 9, 23, 51 and other numbers

 

On April 22, 2011, John Arbash Meinel wrote:
> On 04/21/2011 10:16 PM, Francis J. Lacoste wrote:
> > (Posted on the blog at
> > http://blog.launchpad.net/general/5-9-23-51-and-other- numbers)
> > 
> > We are now two months away from our next Thunderdome. How are we doing in
> > regards with the objectives set for that milestone? You may recall from
> > my
> > 
> > last post the objectives:
> >    * have no timeouts with a cut-off at 9s;
> >    * have an empty critical bugs queue;
> >    * getting a slot free on our ‘Next’ queue.
> > 
> > We practically achieved the first objective! Today, we lowered the hard
> > timeout to 9s and this didn’t increase our number of daily timeouts. We
> > don’t have zero timeouts yet. We still have a fair bunch of timeout bugs
> > to fix. But we get on average 650 requests timing out in a day. That’s
> > less than 0.0001% of our traffic.
> 
> 650 / 8M = 0.008% (0.00008 as a fraction). But still, very well done.

Doh, well spotted! I've made the correction on the blog. 

> 
> > These remaining timeout bugs are part of our second objective. On that
> > front, we are in a more difficult position. We have 259 critical bugs to
> > close. That went up since last time! What went wrong? Well, we had less
> > people working on critical bugs for once. That’s been fixed this week
> > when the Orange squad rotated back on maintenance. We again have two
> > full squads working on critical bugs. Second, we modified our OOPS
> > reporting to show all timeouts happening, not only the ones occurring
> > the most often. That resulted in about 30 new timeouts filed. (See the
> > hight red bar at the start of the graph). Fortunately for us, the rate
> > of new critical bugs is declining.  We are at about 23 on average in the
> > last two weeks. That’s still high and some of those are related to JS
> > regressions escaping to production because our Windmill test
> > infrastructure is disabled. This means that 51 is now the magic number.
> > We need to close 51 of these critical bugs per week to reach 0 by the
> > Thunderdome. That was the number we closed in our best week, just before
> > the number of people working on criticals was reduced. So we’ll also
> > need to reduce the number of new critical bugs found each week to
> > succeed here.
> 
> I realize 259 is a lot, enough that it is hard to get a handle on. Have
> you gone through them at all to see if there are bulk-fixes, things that
> are already fixed, etc. I'm certainly guessing there is a fair bulk that
> are going to be similar in effort, and a really long tail of ones that
> are hard to handle (like problematic timeout pages, etc.) It would be
> interesting to get a feeling for where the knee of the curve is, though.
> 

Robert has recently retriage those so would be in a better position to comment 
on the redundancy we have in the queue. The only one I know of is that a good 
number of the remaining timeouts seemed to happen in python land and could be 
related to the GIL problem. So hopefully should be fixed by us moving to a 
single-threaded deployment. 

Cheers

-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

Attachment: signature.asc
Description: This is a digitally signed message part.


Follow ups

References