launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #06585
performance tuesday - timeout setting, to change or not, that is the question!
We're continuing to make steady progress on timeouts - there are 5
timeout improve/fix patches in the deploy queue at the moment.
But our bug count is up? Whats going on?
Firstly, we're still in a state of sustained overload - as I write
this our 52 worker threads are valiantly handling 165 concurrent
sessions coming through haproxy. This doesn't affect our DB load
hugely: - things queue in the appserver until one of the 4 threads
(current config) is idle and only then get through to the backend.
What it does mean is that /most/ requests are competing for the GIL
for all their logic.
Secondly, timeouts are a *moving* target for us. Francis chose 9
seconds as a goal for the next team meetup, and our current timeout is
14000ms - 14 seconds. We have some very stubborn pages (which we are
making progress on - bug 1 now renders in 350 queries - down from 2000
6 week ago). We also have a number of shallow just not looked at
timeouts - e.g, Branch:+index with lots of bugs (which is now fixed).
3 of the top 4 timeouts on the 28th:
132 / 272 BugTask:+index
69 / 446 POFile:+translate
55 / 144 Distribution:+bugs
53 / 836 BranchSet:CollectionResource:#branches
have patches pending deployment right now (and 3 of the top 4 for the
1st as well).
Anyhow, the net result is that we don't have a dramatic trend line on
the *number* of timeouts.
What we *do* have is a system that is coping with a significant
increase in load and a lower timeout really quite gracefully: on the
28th we had 701 timeouts on 5897577 non-ops pages (the ops pages are
for nagios and haproxy and are approximately free to generate). Thats
0.01% of requests being killed a by a hard timeout : we're achieving 4
9's under trying conditions.
I'm going to ask the losas to drop the request timeout another second
now : this may cause some more pages that are on the edge to start
failing on their first hit (or second if its not a cold cache
situation). if the spike is large, we can roll it back at a moments
notice.
Its worth doing this because the resources we free up allow us to
service well behaving pages that little bit faster : we'll gain 7%
headroom for requests per day by making this change, and thats a
significant step to ameliorate the situation we're in.
Cheers,
Rob
Follow ups