← Back to team overview

launchpad-dev team mailing list archive

Re: result sets and iteration... unsafe idioms - seeking thoughts on how we can avoid them

 

On Sun, Aug 29, 2010 at 7:32 AM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> Heya Rob,
>
> У суб, 28. 08 2010. у 06:24 +1200, Robert Collins пише:
>> On Fri, Aug 27, 2010 at 11:38 PM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
>>
>> >> Even where we have constrained object counts, this is still repetitive
>> >> work (and I suspect there is a 'if view/THING in the tal too, so we'll
>> >> also completely iterate these things 4 times.
>> >
>> > You avoid it in the TAL if you return a ResultSet and check for
>> > view/THING/is_empty.  I assume DecoratedResultSet would do the same.
>>
>> Well, is_empty does a *separate query*. Its not avoiding the work, its
>> doing some of it again.
>
> That's true.  But you said "iterate".  is_empty doesn't iterate, it
> executes a much faster (on average) "limit 1" query (it's roughly as
> slow only when the result is "false", but those are much faster anyhow).
>
> I ain't saying it couldn't be done even better, but is_empty is a worthy
> improvement in itself.

Sure. I see part of my job as figuring out what constructs and
facilities we need to do things /really well/ : so to me, I need to
ask 'whats the best we can do here', and then see if thats reachable.
I agree that an is_empty + one complete query from the db is better
than two complete queries.

>> 300ms is bad: its 50% more than I think we should be aiming for as the
>> *entire render time* for a page.
>> (If we have 99% of requests rendering in < 1 sec, our actual average
>> performance will be better still - and I think 200ms is about right as
>> a goal).
>
> Well, 300ms is not the cause of the problem for a page with 5s of SQL
> time, and 15s of non-SQL time
> (https://lp-oops.canonical.com/oops.py/?oopsid=1601G2848).  That was my
> point.  I think we should aim to fix those 300ms once we've gotten
> everything else sorted out first.

Definitely. I think though, that we should not aim low: we should aim
to eliminate the 300ms waste too; if we decide, tactically, to leave
it for a bit, thats fine. But lets not just ignore it either:
sometimes it will be very easy to eliminate in the same pass.

Looking at the repeated queries, this page is (IMO) suffering from
incremental loading - 2 seconds of that sql time appears to be LFA +
single lookups for bug subscriptions. Which is odd.

Mentoring is still in there (didn't we disable it?)

But the big time jumps are here:
29.	230	3ms	launchpad-main-slave	SELECT Person.account,
Person.addressline1, Person.addressline2, Person.city, Person.country,
Person.creation_comment, Person.creation_rationale,
Person.datecreated, Person.defaultmembershipperiod,
Person.defaultrenewalperiod, Person.displayname,
Person.hide_email_addresses, Person.homepage_content, Person.icon,
Person.id, Person.logo, Person.mailing_list_auto_subscribe_policy,
Person.merged, Person.mugshot, Person.name, Person.organization,
Person.personal_standing, Person.personal_standing_reason,
Person.phone, Person.postcode, Person.province, Person.registrant,
Person.renewal_policy, Person.subscriptionpolicy,
Person.teamdescription, Person.teamowner,
Person.verbose_bugnotifications, Person.visibility FROM Person WHERE
Person.id = %s LIMIT 1
30.	3659	1ms	launchpad-main-slave	SELECT Person.account,
Person.addressline1, Person.addressline2, Person.city, Person.country,
Person.creation_comment, Person.creation_rationale,
Person.datecreated, Person.defaultmembershipperiod,
Person.defaultrenewalperiod, Person.displayname,
Person.hide_email_addresses, Person.homepage_content, Person.icon,
Person.id, Person.logo, Person.mailing_list_auto_subscribe_policy,
Person.merged, Person.mugshot, Person.name, Person.organization,
Person.personal_standing, Person.personal_standing_reason,
Person.phone, Person.postcode, Person.province, Person.registrant,
Person.renewal_policy, Person.subscriptionpolicy,
Person.teamdescription, Person.teamowner,
Person.verbose_bugnotifications, Person.visibility FROM Person WHERE
Person.id = %s LIMIT 1

348.	8717	44ms 	launchpad-main-slave	SELECT BugAttachment.bug,
BugAttachment.libraryfile, BugAttachment.id,
BugAttachment.libraryfile, BugAttachment.message, BugAttachment.title,
BugAttachment.type FROM BugAttachment, LibraryFileAlias WHERE
BugAttachment.bug = %s AND BugAttachment.libraryfile =
LibraryFileAlias.id AND NOT (LibraryFileAlias.content IS NULL) ORDER
BY BugAttachment.id

349.	15868	62ms 	launchpad-main-slave	SELECT BugTag.bug, BugTag.id,
BugTag.tag FROM BugTag WHERE BugTag.bug = %s ORDER BY BugTag.tag

And thats either very slow python, or, more likely unallocated blocking time:
 - either DB time
https://bugs.edge.launchpad.net/launchpad-foundations/+bug/618019
 - librarian time
https://bugs.edge.launchpad.net/launchpad-foundations/+bug/228320
 - some other expensive thing we don't measure (like memcached time)

>> As for stats, that query itself is < 100ms, but the object reloading
>> is significantly more.
>
> That's what I thought with some other pages, and discovered that it's
> not really true.  Storm object serialization was very fast and removing
> it didn't help page performance (in a noticeable way, at least).

Gary has done a ++profile++ decorator now! we should have it on
staging very soon, and it will be possible to just-in-time get a
profiled page to answer 'where the time goes' properly. For instance,
I manually got a losa to do a profile of the attachments API call, and
found that message_to_canonical_url_data is 80% of that time take by
that API call.

>> We don't current *see* that in our OOPS traces.
>> Its also tricky to get data for because its such a hot code path,
>> Gustavo has serious concerns that instrumentation in that area will
>> savagely hurt performance.
>
> We can always cowboy stuff on staging and test directly.  A poor-man's
> instrumentation. ;)

There are serious scaling issues with that. Firstly, it cannot be done
if you decide to scratch an itch on the weekend. Secondly, it takes up
LOSA time - something we are severely bottlenecked on - to do
something innately mechanical, and lastly cowboying stuff on there
interferes with everyone as its a common resource.

The ++profile++ decorate will help a lot, because it will permit adhoc
lsprof generation on pages; while still having some negative impact,
it will at least be sysadmin-free. And it won't need patches to do
adhoc instrumentation :)

> Also, let's suppose it's not object creation, but Librarian instead.  Do
> we have instrumentation there, and if not, could we add it and get it
> into OOPS reports as well (I know they are usually out of sync [separate
> browser request], but the big jump indicates a sync operation, and maybe
> we are talking to the actual librarian over network needlessly)?

If the appserver is talking to the librarian it can easily cause long
page times - code merge proposal pages do this, for instance. I have a
branch extending the db statement timeline to be more generic; sadly I
ran head first into the webapp.adapter mess; I think I'm going to
purge and redo without tackling the global variable insanity in the
first instance, and separately look at reducing the globals.

>> > Anyway, I believe we don't have to worry about things like these (unless
>> > you've got timeouts which prove otherwise, of course :), as long as the
>> > number of queries is static and small.
>>
>> https://bugs.edge.launchpad.net/malone/+bug/583553 - I haven't dug
>> into the detail yet, but it is precisely that: a bug timing out with
>> lots of attachments.
>
> To me, the "big jump" indicates it's the time spent in Python instead
> (and you seem to agree).  Though, the first "big jump" (statements
> 29-30) is between two Persons being selected, and only the second one is
> after BugAttachments are selected.  I would guess that URL formatters
> can be involved here as well, and I would just test-drive a change with
> all fmt:urls removed (for persons and attachements).  Also, this
> particular OOPS never hit any of the other BugAttachment queries, so I'd
> guess that it only goes to the tal:condition and times out afterwards.
> I'd suggest adding a pdb.set_trace() in the property definition and
> stepping through it on staging (at least until OOPSes can record more
> detailed data).

So, its doing lots of bug attachment stuff - the LFA and LC content
queries demonstrate this.

A big time jump like it does, in the middle of a series of identical
queries suggests either 'another thread held onto the GIL, so sorry',
or 'look something normal fast went -slow-, or 'the process hit swap'.
If it was a jump between different query patterns, a theory like
'browser code decided to format some stuff' would make more sense
there to me. Remeber that tal loops will pay fmt costs between each
query batch they provoke - and single-person lookups like are
happening, are I suspect, driven by tal iteration over something that
didn't eager load enough.

The second big time jump is between different query types, and for
that, I could believe fmt: overheads, if there was a big collection
loaded before - which there was (the bug attachment lookup)

> I still suspect librarian slowness (and unresponsiveness), but I don't
> know its architecture well enough to be able to guess further. :)

I doubt it. We do need to run lsprof over this to see.

>> >> I'd love it if we can come up with some way, some convention that will
>> >> make it crystal clear when something is:
>> >>  - cheap
>> >>  - not cheap
>> >
>> > We kind of assumed that using a property meant that something was cheap,
>> > and using a method meant that it could be expensive.  That's what we
>> > tried to do in translations.
>>
>> I think thats what in-python stuff absolutely should do. However
>> there's a problem - we don't prepopulate/eager load our object graphs,
>> so broadly speaking 'none' of our attributes which provide data
>> related to other stored objects meet that criteria.
>
> The question of how many are really fast is the question of criteria at
> the same time (and criteria seems to be changing: 1ms even when it goes
> through database used to be considered fast; if we are aiming for <200ms
> render time, 1ms is way too slow).

Right now, the performance step we're aiming for is 99% of all
queries, all pages, under 5 seconds, with the hard timeout set to
that.

Then we're aiming for 99% of landing pages under 1 second.

Then all pages 99% completing under 1 second.

I'm noting that having 99% complete under time X means that (assuming
a stddev curve), you need your mean + 3stddev < 1000ms;
if your mean is 900ms, your stddev needs to be ~30ms; I bet we're
*way* noisier than that, so if we assume (say) a 100ms stddev, we're
looking at a 700ms mean render time. If we assume 200ms stddev, which
I suspect is realistic with a tuned page, given database load noise,
then we're looking at a 400ms mean render time.

And so yes, 100ms will definitely matter when we get to the third step
in the performance roadmap. I don't think the criteria are changing at
all: the goal (all pages, 99% completion in < 1 second) is the same -
in the epic talk I had no time to unpack what that *means* in terms of
code changes, design changes, metrics etc.

>> I hesitate to suggest that we should mass-migrate in that direction,
>> if we are going to find some way to effectively eager-load. Or perhaps
>> we should migrate, and when eager loading is working we can migrate
>> back piecemeal.
>
> Yeah, I think eager loading is the way to go.  Until we come up with a
> convention on how to do it and keep nice APIs, I think we should just do
> it in whatever hackish way we can think of.
>
> Of course, the pattern you identified in the code is bad, and we know
> how it can be improved.  I am just saying that in this particular case
> it doesn't seem as the first thing to focus on.

There are (at least) a couple of ways we make performance
improvements; one is by taking a slow page and analysing it to figure
out where the problem is, and correcting that. The other way is by
analysing our code and figuring out both specific problem and problem
patterns, and then correcting those.

I totally agree, to fix the OOPS I linked, the code snippet I showed
at the start of this thread is irrelevant.

BUT

That pattern occurs a lot; I see it all over, and we need to find a
way to avoid it, or we're going to be playing hit-the-snake with
performance problems forever.

-Rob



References