← Back to team overview

launchpad-dev team mailing list archive

timeouts, fire fighting and 'we are finished when...'

 

We want low timeouts to prevent cascading server lockups: it only
takes a few long concurrent CPU bound queries and other users
experience huge delays in their work. Low timeouts also gives users a
faster signal when they do something thats unusually expensive. A
snappy system is easier to keep running fast and snappy: less lock
contention, less IO contention etc.

Most, if not all of our timeouts are caused because we *have a
successful product*. Starting with the simplest thing and iterating is
a fantastic principle, but the corollary is that we have to iterate:
the hard timeout limit is our backstop for when something we're
iterating on is made slower by users. So every time we make the system
better we can simply expect more users and more pressure on the DB,
more mails to send, more bugs to index, etc.

Thus, it seems to me that we should only consider a timeout problem
fixed when we *have enough headroom* to tolerate growth for a
reasonable time. Concretely, something that is timing out on 5% of
calls and routinely taking (say) 15 seconds on the server today -
thats something we should bring down to completing in (say) 3 seconds
on the 15 second dataset before switching context. Otherwise we'll be
coming back to it almost immediately, and progress will feel slow. In
bzr, once we got status really seriously under control, it let us stop
getting hammered with performance issues in that part of the code
base, we stopped reanalysing the same problem and were able to forget
about it for a couple of years because it was under control.

There are some infrastructural issues that will make this hard: I'm
going to unblock anyone who wants to work on such things (like rabbit)
in anyway I can; as time permits I'll be working directly on those
enablers too.

Cheers,
Rob