launchpad-dev team mailing list archive

Thread
Date

Performance tuesday - polish and storm, polish and storm

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Tue, 7 Sep 2010 18:50:59 +1200
Sender: robertc@xxxxxxxxxxxxxxxxx

I hope everyone is having/has a great Tuesday. I also encourage you to
join in with me and spend Tuesday - just tuesday - really looking at
performance issues.

But enough of the sermon; did Robert do anything interesting today?
Why yes, yes he did.

The restricted librarian change to be visible on the internet is 'code
complete' - that is, there are still deployment changes to make, HTTPS
certificates to get hold of and so on. The change is guarded on the
appserver side by a feature flag: once all the bits are ready and any
rollout fallout has completed, we can switch it on, live, on the
appservers. We also still have to QA it (which needs the new certs)...
we may find, for instance, that there are tweaks to do to support it
in APIs (API clients will need to follow redirects, for instance).
This will:
 - remove all the 'high duration' times which are really internet
clients downloading soyuz logs and bug attachments from the appserver
 - as a result make the appservers less prone to fall over in a
screaming heap when the librarian glitches.

I also landed a very small tweak to requesttimeline to not use a
999999ms sentinel, so OOPS reports after todays edge rollout will be
back to normal (but with the extra bits still added).

Following on from that I put up a storm patch to let us instrument how
long it takes to retrieve data from the raw resultset in storm : this
may be unneeded, James Henstridge is very sure (has checked code sure)
that its all in-memory and essentially free.

Then, and this is a bit that may interest Danilo, I got stuck back
into Distribution:+assignments - bug 618367; this page shows 3000
specification assignments, and takes quite a while to do so. However
the query to retrieve them is only 800ms (with eager loading of
people) or 500ms (without, but the separate person query then takes
1.5 seconds... go figure).

So where goeth the time? All the gory details are in the bug, but to
save folk chasing pointers I'll summarise here.

Gary's ++profile++ decorator is very sexy. Thanks! I asked spm to
enable that on staging for a limited time, and grabbed a profile of
the page.

The profile showed 16 seconds (of 25 up to the kill-point) going into
storm object deserialisation. 30000 objects. (Thats 3000 rows, 10
objects per row).

So, I filed bugs on storm -  632140 for NULL object checks (small
saving, but probably worth doing), and bug 632145 about the overall
performance. I then went to see if doing two queries would be better.
Again, some help from spm to apply the patch and enable profiling and
we go the answer.. its a wash : it still takes 12 seconds (with
profiler, 10 without) to process the *3000* rows of Specification
objects *alone*. So in the 30K object trace, 4 seconds for 27K
objects, 12 seconds for the other 3K.

I suspect there are some constructs in the class definition that
perform poorly in storm / trigger some sort of pathology; more digging
is needed. For instance, there are 100K checkpont calls, on 3000
objects - 30 checkpoints *per object* on average. That is 2% of the
total runtime of the page, so not a big candidate to spend time on -
but its still worrying that this housekeeping, which is *entirely
uninteresting in read only transactions* is being paid for.

I speculate that ripping that code out of the main path in storm and
putting it into a separate facility used for mutation operations would
make a phenominal difference.

Now the bit that *may* interest Danilo: if this is right, when we have
high nonsql time, and lots of rows coming back, we should add storm
into the list of suspects (and get a profile, always get a profile
:)).

-Rob

Follow ups

Re: Performance tuesday - polish and storm, polish and storm
From: John Arbash Meinel, 2010-09-07