← Back to team overview

launchpad-dev team mailing list archive

performance tuesday - timeout setting, to change or not, that is the question!

 

We're continuing to make steady progress on timeouts - there are 5
timeout improve/fix patches in the deploy queue at the moment.

But our bug count is up? Whats going on?

Firstly, we're still in a state of sustained overload - as I write
this our 52 worker threads are valiantly handling 165 concurrent
sessions coming through haproxy. This doesn't affect our DB load
hugely: - things queue in the appserver until one of the 4 threads
(current config) is idle and only then get through to the backend.
What it does mean is that /most/ requests are competing for the GIL
for all their logic.

Secondly, timeouts are a *moving* target for us. Francis chose 9
seconds as a goal for the next team meetup, and our current timeout is
14000ms - 14 seconds. We have some very stubborn pages (which we are
making progress on - bug 1 now renders in 350 queries - down from 2000
6 week ago). We also have a number of shallow just not looked at
timeouts - e.g, Branch:+index with lots of bugs (which is now fixed).

3 of the top 4 timeouts on the 28th:
     132 /  272  BugTask:+index
      69 /  446  POFile:+translate
      55 /  144  Distribution:+bugs
      53 /  836  BranchSet:CollectionResource:#branches

have patches pending deployment right now (and 3 of the top 4 for the
1st as well).

Anyhow, the net result is that we don't have a dramatic trend line on
the *number* of timeouts.

What we *do* have is a system that is coping with a significant
increase in load and a lower timeout really quite gracefully: on the
28th we had 701 timeouts on 5897577 non-ops pages (the ops pages are
for nagios and haproxy and are approximately free to generate). Thats
0.01% of requests being killed a by a hard timeout : we're achieving 4
9's under trying conditions.

I'm going to ask the losas to drop the request timeout another second
now : this may cause some more pages that are on the edge to start
failing on their first hit (or second if its not a cold cache
situation). if the spike is large, we can roll it back at a moments
notice.

Its worth doing this because the resources we free up allow us to
service well behaving pages that little bit faster : we'll gain 7%
headroom for requests per day by making this change, and thats a
significant step to ameliorate the situation we're in.

Cheers,
Rob



Follow ups