← Back to team overview

launchpad-dev team mailing list archive

Oops processing/displaying/generating/reporting overhaul

 

I've been chatting with my counterparts in other web teams at
Canonical for a while now about how we all gather failure information
on our services. What I've found out is that we have a pretty
fragmented solution to this problem - different reimplementations of
OOPSes, forks of the code base. Both Launchpad and other teams have
scaling and latency issues surrounding OOPSes.

So, in preparation for fixing *our* issues around OOPSes, I've now
written up a LEP describing what I see as our requirements and
constraints, as well as some of the requirements and constraints from
Ubuntu One. I expect to get similar things from other web teams over
the next week or so. (If I don't, I'll go nag :P).

I'd really appreciate feedback and critique of the LEP - are there
hidden assumptions I should call out that will influence our results?
Have I missed a crucial problem? Is there a canned solution we can
just grab and use.

The LEP: https://dev.launchpad.net/LEP/OopsDisplay

This will be a new codebase for several reasons:
 * the only thing in common with the existing code base will be the
sql statement normalising code
 * the project, like other components of Launchpad, will be AGPL3
 * Transitioning the existing code base will run into the very
friction that makes it hard to improve on at the moment (we've had
several engineers founder trying to do nontrivial changes to it).

At this point, I think that we should do concentrated work on this
sometime after doing the merge machinery and parallel testing work,
but it may be that some interested folk want to do patches during idle
cycles : I'll see about bootstrapping a minimal environment once all
the constraints are in and the LEP has been reviewed by jml.

-Rob



Follow ups