launchpad-dev team mailing list archive

Thread
Date

Re: performance tuesday - timeout setting, to change or not, that is the question!

To: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
From: John Arbash Meinel <john@xxxxxxxxxxxxxxxxx>
Date: Wed, 02 Mar 2011 11:19:00 +0100
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <AANLkTikX+BX_=5mSbpkr=ycWknuui7+KkuobcA3inpWc@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.13) Gecko/20101207 Lightning/1.0b2 Thunderbird/3.1.7

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/2/2011 3:06 AM, Robert Collins wrote:
> We're continuing to make steady progress on timeouts - there are 5
> timeout improve/fix patches in the deploy queue at the moment.
> 
> But our bug count is up? Whats going on?
> 
> Firstly, we're still in a state of sustained overload - as I write
> this our 52 worker threads are valiantly handling 165 concurrent
> sessions coming through haproxy. This doesn't affect our DB load
> hugely: - things queue in the appserver until one of the 4 threads
> (current config) is idle and only then get through to the backend.
> What it does mean is that /most/ requests are competing for the GIL
> for all their logic.

What is the status of moving all the appservers to single threaded so
that you don't end up in GIL contention? Would this also include
starting up more app servers per physical box?

...

> 
> What we *do* have is a system that is coping with a significant
> increase in load and a lower timeout really quite gracefully: on the
> 28th we had 701 timeouts on 5897577 non-ops pages (the ops pages are
> for nagios and haproxy and are approximately free to generate). Thats
> 0.01% of requests being killed a by a hard timeout : we're achieving 4
> 9's under trying conditions.

That does seem pretty good.

> 
> I'm going to ask the losas to drop the request timeout another second
> now : this may cause some more pages that are on the edge to start
> failing on their first hit (or second if its not a cold cache
> situation). if the spike is large, we can roll it back at a moments
> notice.
> 
> Its worth doing this because the resources we free up allow us to
> service well behaving pages that little bit faster : we'll gain 7%
> headroom for requests per day by making this change, and thats a
> significant step to ameliorate the situation we're in.
> 
> Cheers,
> Rob

I believe you have the general time-to-render info for all requests. As
such, can't you mostly predict the effect of dropping the hard timeout?
(How many queries are currently completing in 14s, but would not
complete in 13s?) All of these seem really far away from a 9s, or the
future 5s goal.

Is there a better way to drive timeout fixes than forcing the hard
timeout? Or is it that you expect dropping the hard timeout will cause
the evil threads to die earlier, and thus actually speed up all the
other ones...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1uGZQACgkQJdeBCYSNAAOhqACeM0tCIX+hpoojQQRhtk3eZ1cW
2X8AoJ3UiYQ5iVL9EGo5yh45uItcZEdq
=8rEp
-----END PGP SIGNATURE-----

Follow ups

Re: performance tuesday - timeout setting, to change or not, that is the question!
From: Robert Collins, 2011-03-02

References

performance tuesday - timeout setting, to change or not, that is the question!
From: Robert Collins, 2011-03-02