← Back to team overview

launchpad-dev team mailing list archive

Re: memcache, responsiveness and load {short story, lets turn memcache off}

 

On 2010-08-05 07:59, Martin Pool wrote:

If the main problem is "user changes something but doesn't see the change
reflected in the page," that's the same problem we have with replication
lag.  Couldn't we solve that in the same way, by having a user bypass
memcached for a while after a POST?

We could.  It would be a fairly cute way to solve the "but I thought I
just changed that?" bug, but it's not a a total solution: "but Jeroen,
I thought you said you fixed that?"  On the whole I would still call
it a bandaid.

Actually I think it does solve the problem when the cached lifetime is shorter than the time it takes for me to tell you I've done it and for you to load up the page with the expectation to see the work done. This is why I'm suggesting very short expiry times.

Of course there's also replication lag, browser caching, transparent proxies, and the reverse proxy so I am taking a bit of an I-don't-need-to-outrun-the-bear-I-just-need-to-outrun-you view. If skew from those other things aren't a problem now, I'm saying we could cheaply ensure that memcached doesn't make things worse.


One thing we could do is to use feature flags to turn on or off TAL
caching, so that we can make the correctness/throughput tradeoff
dynamically when we're being slashdotted.  (Again, flickr etc
apparently use this technique.)

I'd want it enabled all the time--but with expiry time set just long enough to take the edge off a load spike for very specific fragments. Could be as short as a second for all I care.

When slashdot strikes, I would _not_ want our users to time out until the oopses show up in our email the next morning, and some enterprising engineer checks the referrer, and the problem is debated with IS, and decisions are held off until someone responsible comes online, and then caching is enabled either by cowboy patch or a multi-handoff review procedure, and finally the jolly lot of us figure out whether any glitches are due to the spike, to a systemic failure, or to pre-existing problems that we were hiding because caching was disabled.

Maybe I'm over-focusing on the Sudden Deadly Spike. I just find it a useful way to think about memcached because it removes all temptation to reduce timeout counts without fixing latency. It's also what makes me feel that low hit rates are fine for normal days--as long as they shoot up (and app/db load stays relatively steady) when Google's Doodle of the Day happens to link to Bug #1.


Jeroen



References