launchpad-dev team mailing list archive

Thread
Date

Re: cold cache timeouts == OOPS == Critical

To: Gary Poster <gary.poster@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Wed, 3 Aug 2011 07:19:56 +1200
Cc: Launchpad Development List <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <3AED6942-A744-4A0C-82E9-3638604960DE@canonical.com>
Sender: robertc@xxxxxxxxxxxxxxxxx

On Wed, Aug 3, 2011 at 1:53 AM, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:

> Past apps I've worked on have regarded hot cache bugs as critical, and cold cache bugs as something to cope with, one way or another.  What LP has now is a higher standard, which is nice, except that we haven't managed to meet the lower one yet.
>
> This might just be an observation that I've shared, and we all nod our heads and move on.  That's fine.  I'm also fine with considering changing our policies.  Options would include the following:
>
>  * cold cache bugs are a lower priority, or even Won't Fix.
>  * cold cache bugs are grouped together in a single critical bug which is about keeping out caches hot (I'm not sure what, if anything, can be improved here, to be clear; I'm speaking in the abstract).  That kind of change wouldn't make the problem go away, though; it would just make it less frequent.

I see a couple of factors in considering a change here.

Firstly, the reasons for oopses-are-critical: there are two:
 * An OOPS usually means a user being unable to use the system
 * An OOPS is something we need to investigate

So any stream of unimportant OOPSes sucks our maintenance squads time
- we need to fix things so we don't see them, so that when we get an
important OOPS, we can leap on it and fix it: we want a good signal to
noise ratio.

And we want the system to work for users.

LP's database is 300GB, more or less. Thats a -lot- to fit into
memory, and thats just after a complete pack-and-optimise due to our
rebuilding everything.

URLs that are rarely used are more likely to have cold cache
behaviour; and so more likely to be slow and timeout.

So, I think we need to design with cold cache in mind, at least with
our current environment. Designing with cold cache in mind has
multiple benefits:
 - it makes it *safe* for us to run with less memory than DB - so its
cheaper to run LP as we continue to grow
 - it will help with hot cache operations because we'll be doing less
work for them as well
 - it helps when we have to reboot a db server, if the clients can
tolerate it not having the whole DB in memory right after startup.

When I joined as TA a year and a bit ago, the whole team cared about
performance, but was having trouble executing in a systematic way; we
have come a tremendous distance since, and learnt a great deal about
what makes the system perform well or poorly.

Its true that we have not yet fixed every slow page that was already
slow a year ago, but then we knew we had a lot of performance debt to
address.

I think fixing things to be tolerant of *some* cold cache situations
fits under the existing approach just fine - we will need some schema
refactorings, some of the time; other times its just query tuning so
that we don't read cold rows.

We probably cannot remove -all- cold cache effects -all- the time. I
suggest we be guided by the numbers: if a page is in the timeout
report, then it was genuinely too slow.

HTH
-Rob

Follow ups

Re: cold cache timeouts == OOPS == Critical
From: Stuart Bishop, 2011-08-03

References

cold cache timeouts == OOPS == Critical
From: Gary Poster, 2011-08-02