← Back to team overview

launchpad-dev team mailing list archive

Re: memcache, responsiveness and load {short story, lets turn memcache off}

 

On 2010-08-05 07:59, Martin Pool wrote:

If the main problem is "user changes something but doesn't see the change
reflected in the page," that's the same problem we have with replication
lag.  Couldn't we solve that in the same way, by having a user bypass
memcached for a while after a POST?
We could.  It would be a fairly cute way to solve the "but I thought I
just changed that?" bug, but it's not a a total solution: "but Jeroen,
I thought you said you fixed that?"  On the whole I would still call
it a bandaid.
Actually I think it does solve the problem when the cached lifetime is 
shorter than the time it takes for me to tell you I've done it and for 
you to load up the page with the expectation to see the work done.  This 
is why I'm suggesting very short expiry times.
Of course there's also replication lag, browser caching, transparent 
proxies, and the reverse proxy so I am taking a bit of an 
I-don't-need-to-outrun-the-bear-I-just-need-to-outrun-you view.  If skew 
from those other things aren't a problem now, I'm saying we could 
cheaply ensure that memcached doesn't make things worse.

One thing we could do is to use feature flags to turn on or off TAL
caching, so that we can make the correctness/throughput tradeoff
dynamically when we're being slashdotted.  (Again, flickr etc
apparently use this technique.)
I'd want it enabled all the time--but with expiry time set just long 
enough to take the edge off a load spike for very specific fragments. 
Could be as short as a second for all I care.
When slashdot strikes, I would _not_ want our users to time out until 
the oopses show up in our email the next morning, and some enterprising 
engineer checks the referrer, and the problem is debated with IS, and 
decisions are held off until someone responsible comes online, and then 
caching is enabled either by cowboy patch or a multi-handoff review 
procedure, and finally the jolly lot of us figure out whether any 
glitches are due to the spike, to a systemic failure, or to pre-existing 
problems that we were hiding because caching was disabled.
Maybe I'm over-focusing on the Sudden Deadly Spike.  I just find it a 
useful way to think about memcached because it removes all temptation to 
reduce timeout counts without fixing latency.  It's also what makes me 
feel that low hit rates are fine for normal days--as long as they shoot 
up (and app/db load stays relatively steady) when Google's Doodle of the 
Day happens to link to Bug #1.

Jeroen



References