launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #01308
Announcing Zero-OOPS Policy
Hi everyone,
Here is the actual details of the Zero-OOPS Policy we discussed last week in
London. The initial cost of this policy will be high, but we'll get a much
smoother operation once we pass the initial start-up cost.
That document is available on the wiki at
https://dev.launchpad.net/PolicyAndProcess/ZeroOOPSPolicy
= Policy Overview =
In a nutshell, this policy is about moving the tolerance-level for OOPSes to
zero. This mean that any user-visible error happening in production is a
stop-the-line event and should be fixed ASAP.
== Why this policy? ==
We should be proud of the service we build and deliver, and we cannot take
pride in a low-quality product. Everytime an OOPS page reaches a user, whether
because of a time out or an unhandled exception, we failed on the measure of
quality. An OOPS page means that a user was prevented from completing their
work, that's really bad.
Having zero tolerance for OOPSes in production means that we are putting
actions behind our mantra of quality. An OOPS is basically an escaped defect,
and we cannot tolerate that.
Daily we have between tens and hundreds of OOPS. This policy is basically
about making sure that the Exceptions and Timeouts section of the report are
empty.
== What should be done about OOPSes ==
* Everytime an OOPS is encountered in production, a bug should be filed for
it with priority of High. It should be tagged with either 'oops' or
'timeout' on it.
* Fixing bugs tagged 'oops' and 'timeout' takes priority over any
feature development.
* We should deploy all possible OOPS fix to production.
Once we achieve Zero-OOPS status:
* Do root-cause analysis for every OOPS that occurs in production, to ensure
that our process is really robust against escaped defects.
== But All OOPSes are not equals ==
All OOPSes in the "Exceptions" and "Time outs" sections should be eliminated.
If an OOPS isn't important - because it's only triggered by robots, or for
whatever reason, then it shouldn't record an OOPS. Change the exception
type so that it doesn't appear in these sections.
The end-goal is that the users don't get the BSOD pages and that the OOPS
report sections are empty. So that when something appears there, we know it's
a problem to fix. No sifting through many false positives.
== When ==
We are starting this policy now.
== Coming Soon ==
Burn down chart of the bugs with the "oops" and "timeout" tags as well as a
timeline of the number of OOPSes per day in the different production
component.
--
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx
Attachment:
signature.asc
Description: This is a digitally signed message part.
Follow ups