launchpad-dev team mailing list archive

Thread
Date

Re: importance inflation (was: merge-proposal-jobs interruption incident)

To: Aaron Bentley <aaron@xxxxxxxxxxxxx>
From: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
Date: Fri, 5 Aug 2011 08:13:45 +1200
Cc: Launchpad Development <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <4E3AA9A2.8010006@canonical.com>

On Fri, Aug 5, 2011 at 2:16 AM, Aaron Bentley <aaron@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11-08-04 09:24 AM, Gary Poster wrote:
>> I spoke with Aaron on IRC.  he identified 820510 and 820511 as the biggest bang for the buck.  He did not want to mark these as critical.  I don't really agree, but I also don't really care a lot.  Yellow hopes to tackle at least one of those RSN.
>
> I think these bugs are important to fix, but I don't want to mark them
> critical in order to justify fixing them.  Marking bugs critical when
> they really aren't makes us take longer to fix bugs that actually are
> critical, such as 556245, "librarian1 non responsive and gobbling
> excessive memory".

I agree.

> I think we're experiencing importance inflation, where things are
> getting marked higher than they should be, in order to increase their
> priority.  Instead of using "importance" to determine priority, I think
> priority should be an ordered list that we control directly.  That gives
> us much finer control, makes escalation trivial, and removes the
> temptation to inflate importance.

I think this would be a very interesting experiment to do - have a
functional experiment in the bug tracker where we can partially sort.

> When I'm doing triage, I have the incentive to mark a bug "critical" if
> I care about it, because our focus on critical bugs means that high bugs
> are rarely fixed.

OTOH, Francis and I swing by and uncritical things on a reasonable frequency ;)

> In scheduler theory, this is called "starvation".  However, this is also
> a solved problem: instead of trying to process all items in the highest
> priority class first, use ratios of classes.  That way, progress is made
> on all classes, but priorities are respected.  If we applied this to
> Launchpad, we could say, "For every 5 critical bugs you close, close a
> high bug.  For every 5 high bugs you close, close a low bug."

We could (that approach is called weighted fair queueing IIRC), but I
don't think it applies to the critical vs high bug sets: it applies
when you have N distinct queues which all need to be processed.

We have that situation with feature work and maintenance work: before
the reorg feature work starved out maintenance work, now we have
weighted things so that feature work isn't starved out by maintenance
work.

I think the bugs you filed about increased reporting in the job runner
will matter a great deal if the failure was not a one-off. We will
have to spend staff time dealing with an operation incident if it dies
again, and given that the cause is not yet known, its likely that we
will have to. This is very much the same situation as an OOPS: until
we diagnose the cause of a given oops we cannot say whether:
 - it will happen again (and so we should fix it so that it doesn't
mask other oopses)
 - it was a one off (and so we can just close the bug)

IMO when you have two bugs, a bug (A) that is directly critical per
our triage rules, and a 'high' bug (B) saying that its very hard to
identify whats going on in bug A, then bug B has to have its
importance inflated to critical, otherwise bug A will suffer from
priority inversion: imagine that A is the last critical bug. It can't
be worked on because the data to solve it will come from closing bug
B. And B is not critical, so its going to be worked on somewhere in
the 6-month window of bugs that we try to size the 'high' set to. This
then gives A an effective priority of 'high'.

This is a classic priority inversion, and the normal scheduling fix is
to grant the higher priority to the task holding the resource needed
by the higher priority.

I will do another pass and challenge bugs that shouldn't be critical.
But, I think the underlying issue here isn't inflation of
non-criticals to criticals.

If we have 10 people working just on criticals: oops, timeouts,
production problems, escalated requests, and we're going backwards
then it indicates either:
 - we need more resources
 - we need less time per fix
 - we need a lower rate of incoming criticals

I think our time-per-fix is still very high, and things like the test
suite time directly impact this.

And recently, we've had a big uptick in incoming criticals, that
worries me but I've not done an analysis yet on why.

-Rob

Follow ups

Re: importance inflation
From: Aaron Bentley, 2011-08-08

References

merge-proposal-jobs interruption incident
From: Aaron Bentley, 2011-08-03
Re: merge-proposal-jobs interruption incident
From: Gary Poster, 2011-08-03
Re: merge-proposal-jobs interruption incident
From: William Grant, 2011-08-03
Re: merge-proposal-jobs interruption incident
From: Gary Poster, 2011-08-04
Re: importance inflation (was: merge-proposal-jobs interruption incident)
From: Aaron Bentley, 2011-08-04