← Back to team overview

launchpad-dev team mailing list archive

performance tuesday: rambling on efficiency

 

So,  I recently got a new desktop which I'm using to kick the tires on
parallel testing for Launchpad.

I've a little bit of a story about it, but nothing specific to
performance in the web app. OTOH it may illustrate how tricky
performance is ;)

I've had a bit of a fun time with its disks - its got plain old
spinning platters rather than SSDs :).

I got it with 2 disks, and I salvaged 2 more from my old desktop, and
put them into a raid 1+0 (which is a stripe set built out of two
mirror-sets : can tolerate a single disk failure, any one write goes
to 2 disks, any read can be serviced from either of two disks.

It turns out the dm-raid1 driver doesn't do load balancing reads, so
that last point is more theoretical than practical in natty.

I did some itch scratching on this - whipping up a patch to the
dm-raid1 driver to load balance.

My first attempt to do read load balancing was a complete success:
each successive read request I sent to a different disk, and iostat
clearly showed me reading from both.

It was also massivley slower at sequential IO: whereas cat largefile >
/dev/null was 100MB/s on natty's released kernel, I was lucky to get
25MB/s from each of the two drives - a loss of approximately 50%
performance.

Now, the dm- layers work by mapping requests: a request comes in, and
may get split into two (e.g. if it crosses a stripe segment). Raid 0
and raid 1 have no parity requirements so can just map any read
request into one or more read requests from the backing device.

So you get an IO chain:
actual request on dm device
-> one or more requests on backing device, which are submitted back in
the top of the kernel IO stack

this means that reqest *merging* can happen on the requests to the
backing device. This is important because something simple like 'cat
bigfile > /dev/null' will trigger the kernel readahead behaviour, and
that generates up to 1000 separate IO requests per second - running
ahead of the actual reads cat is performing. If each little IO request
was serviced separately, total performance would be very slow - we'd
be limited by the command depth of the IDE disk (e.g. 31 tagged
commands), and if the requests are small enough, this won't cover a
cylinder, so we run into rotational latency etc.

So to pick up the story, my patch changed 1000 requests/sec which were
being merged into 50 requests/sec, into 500 requests/sec per drive,
which were not merged at all, and only 30 could be issued to each
drive at once, so while the rate at which IO requests were satisfied
was approximately the same, the amount of work done per second
plummeted.

I've changed the patch to track where the last submitted request was
per backing device, and preferentially choose the closest one - this
retains the prior behaviour for sequential IO but starts load
balancing random reads quite tolerably.

Our Lazr.restful API which exposes lots of little methods is really
very similar to this situation, but without the merging concept: we
get lots of tiny requests, which would be more efficient in batch
mode.

Now, where this all gets interesting is when you consider queue
servicing: if we do 100MB/s of IO for a 1/2 the time vs 50MB/s, then
we can service more things in a given time span - as long as we don't
try to do more concurrent work. If we try to more concurrent work, the
efficiency drops and everything just gets slower (which is what was
happening to our python appserver processes until our recent
reconfiguration).

Speaking of queuing, I've recently put a lock around our test suite's
creation of temporary databases: when we create a new database in the
test suite it takes 0.2seconds *optimally* just in postgresql. Once we
have 5 worker threads making new databases, they can starve each other
out if the tests are fairly fast. Whats worse, is that if two users
try to create a database at the same time, postgresql will fail -
because the template db can only have one user in it when CREATE
DATABASE is called. And the failure is *slow* - on the order of
seconds. This massively increases the contention if we don't have a
lock around calling CREATE DATABASE, and so things snowball.

Anyhow, sorry i didn't have anything directly LP specific this week,
but I hope this little ramble was interesting. It was interesting to
see the same basics turning up in the kernel performance for me anyhow
;)

-Rob