← Back to team overview

launchpad-dev team mailing list archive

memcache, responsiveness and load {short story, lets turn memcache off}

 

So, we've got a partial deployment of memcache in place, and while it
may be contentious, I think it has distracted us from solving root
cause problems.

Executive summary
================

memcache is getting a 24% hit rate
(https://lpstats.canonical.com/graphs/MemcachedTotal/ - sorry that its
private). We're pushing a bunch of stuff into it and only reading back
a small fraction (42.87 misses to 13.71 sets per $unit-of-time).

memcache doesn't help with our miss-case responsiveness at all : it
can only amortize expensive queries, and help us endure load storms.

Focusing on the root causes will give us a better user experience more
quickly, and without the side effects we're having from memcached.

We can add memcached back into the mix (its a great tool) once we're
starting to drive user volume up (rather than our current problem of
backend-data scaling/arrangement).

Details
=======

Our memcached implementation is somewhat fragile in the way its glued
into the system. Tt has only a time based expiry policy, nothing event
driven (and event driven will be tricky today because we don't have an
event system ready to use that will pervasively detect changes that
will impact queries and map those back to objects to expire). A time
based expiry policy is only sufficient for pages where users feel ok
seeing stale results. For instance, the ubuntu overall bug count
portlet is a fine one to cache: its expensive to calculate (it calls
count(*) on a large dataset) and noone can really tell if its fresh or
not. On the other hand, the bug count portlet for a new project is
terrible to cache: its cheap to calculate (small data set) and users
can tell immediately that its out of date.

So, this will be hard to fix (deep dependency chain, some very clever
code needed), and if not fixed will continue to provoke a series of
new bugs about where we cache things that shouldn't be *visibly*
cached.

And it is of very little benefit to us today: it doesn't (and won't
ever) help with cache misses. And on top of that we only see a 24%
cache hit rate, which is ok if we're using it to weather readonly
storms, but that isn't our deployment model today - our model is to
use it to avoid work which is inherently expensive, which is not a
good use of memcached. We need to instead make that work cheap
(somehow).

So, I propose to turn memcached off on Friday, unless I'm lynched
before then, which I kindof expect :)

Doing this will immediately close a half-dozen bugs, and focus our
timeout and performance efforts closer to the actual source of our
problems.

In the future, I would like to be able to turn memcached back on in a
more scaling role, which is what it is designed for and great at.

-Rob



Follow ups