launchpad-dev team mailing list archive

Thread
Date
updates to our production environment; capacity planning; request latency

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Fri, 1 Apr 2011 18:40:51 +1300
Sender: robertc@xxxxxxxxxxxxxxxxx
We've talked in various threads about different server configurations
- e.g. changing the thread count - in the context of render time, OOPS
counts and so forth.

I've been a bit slack getting back to the list about this, but we've
actually experimented and had really promising results in driving
datacentre performance.

We also found that we're running *right on* capacity : we regularly
back up a little in the datacentre, so a big traffic spike would
significantly impair the service. However we do serve between 5 and 8
Million page renders a day (from zope alone - ignoring the librarian,
loggerhead etc).

I'm not sure quite where to start, so I'm just going to describe the
changes we've made to our environment, and why, and the benefits we
should get from it.

The pipeline a request comes through is:
apache (SSL) -> squid (anonymous only) -> haproxy -> zope appserver
                   \------ authenticated/api-------->/

Each step in that pipeline is horizontally scalable: we have multiple
apaches, squids, haproxy etc.

Previously, the haproxies were configured in an active-active
configuration, where each had its own view of the load of the zope
appservers load; this led to poor queuing behaviour: if one haproxy
had (say) 3 big requests land on one backend and the other haproxy had
a request come in, it might think that the appserver has 2 free
threads - yet actually have that request queue behind the 5 (3 from
one haproxy, 2 from the other) slow requests.

We now have the haproxies in an active-passive configuration where
only one haproxy forwards requests to the appservers at a time; this
lets the haproxy accurately assess the concurrency on the backends.
We've also upgraded to haproxy 1.4 which inspects each http request
and dispatches them to backends rather than just gluing the two sides
togther (which 1.3 and before do). The combination has dramatically
dropped in-datacentre latency and given us efficient reuse of backend
threads.

6 months ago our zope appservers were (nearly) all configured to run
with 4 worker threads and 16 concurrent sessions (8 from each
haproxy). We had a lot of data suggesting that requests were CPU
starved when the appserver had multiple concurrent requests. For
instance, we saw multiple seconds occur between successive queries in
trivial code on the internal xmlrpc server - which was running at max
capacity all the time before we incorporated that service as part of
the main zope appserver cluster.

We are now migrating all our appservers to run with 2 worker threads
and 1 concurrent http request (note the change from session to request
due to the haproxy upgrade). The appservers that have been migrated
are delivering fewer OOPSes on a day to day basis, even allowing for
the reduced thread count. We have 2 threads to permit +opstats and
other monitoring requests to be serviced in a timely fashion even if a
big sql query is delaying the main request.
Some of the appservers will also be serving internal xmlrpc requests
on that second thread - but they will be limited to one concurrent
request for that at a time as well.

This means we're going to be running around 4 times as many zope
appservers as we were; rather than 4 threads per appserver and one CPU
getting used - we'll have 4 appservers which can potentially use 4
CPUs at once. The page performance report shows that about 1/2 our
request overhead is in SQL at the moment, so we can fairly safely run
2 appserver instances per CPU. The increased efficiency means we
should be able to run fewer worker threads than we are at the moment,
but we don't know *how much fewer*. We also need to get some headroom
- we should be running with enough excess capacity that we can take a
machine offline to repair it without impacting the service. Each of
the new-config appserver instance spikes up to ~ 800MB of RAM - we've
come up with a budget of 1GB of RAM per instance. Currently we have a
spread of very old machines through to some quite new ones running the
appservers. We're replacing the older machines with equivalent
hardware to our newer machines; we're going to end up with 4 appserver
machines, with 24GB of RAM and 8 CPUs each: that will let us run 15
appserver instances (based on current sql/nonsql time ratios) safely
per machine; 3 machines will give us 45 instances serving traffic (we
want to have N+1 capacity). We do between 70 and 90 renders per second
on this cluster, with a mean of 0.33 seconds per request - so we
should be safe with this configuration. If we aren't, the hardware can
be further upgraded to let us run more appserver instances on each
machine - but we need to look at the actual utilisation and load once
we're reconfigured - if we don't need extra capacity right away, there
is no point spending the money initially.

Fully loaded, this new configuration will be able - at peak - to run
60 worker threads, with 30 concurrent DB requests ; our master DB
server has 16 CPUs, the two slaves have 8 CPUs each : we still have
plenty of headroom in the DB based on the database performance report,
but we're removing a bottleneck with the appserver reconfiguration...
Stuart and I will be looking closely at the load on the DB server(s)
as the appserver reconfiguration happens.

When this project is finished, we should have nearly no in-datacentre
queueing: the render time + transfer should be the entirety for any
request (and the render time shown in LP pages should thus correlate
pretty well with perceived time in the absence of javascript causing
browser-render issues). Right now we have low queue times most of the
time, with occasional queueing. We can also expect high-CPU pages to
have even less impact and interactions with other requests.

Finally, IS are working on improving the SSL configuration we have to
share the SSL session cache between the Apache front ends, which will
reduce browser rehandshaking for roaming users.

I hope this has been useful and relevant - knowing how our code is
deployed can make a big difference to the guesses and theories we have
to explain timeouts, OOPS and the general perceived performance of
Launchpad.

I'd be delighted to fill in any blanks I missed :). My huge thanks go
out to the Ubuntu server team, IS (projects, GSA & LOSA) for all the
effort they have put in to bring this about.
- Rob