← Back to team overview

launchpad-dev team mailing list archive

Re: result sets and iteration... unsafe idioms - seeking thoughts on how we can avoid them

 

Heya Rob,

У суб, 28. 08 2010. у 06:24 +1200, Robert Collins пише:
> On Fri, Aug 27, 2010 at 11:38 PM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> 
> >> Even where we have constrained object counts, this is still repetitive
> >> work (and I suspect there is a 'if view/THING in the tal too, so we'll
> >> also completely iterate these things 4 times.
> >
> > You avoid it in the TAL if you return a ResultSet and check for
> > view/THING/is_empty.  I assume DecoratedResultSet would do the same.
> 
> Well, is_empty does a *separate query*. Its not avoiding the work, its
> doing some of it again.

That's true.  But you said "iterate".  is_empty doesn't iterate, it
executes a much faster (on average) "limit 1" query (it's roughly as
slow only when the result is "false", but those are much faster anyhow).

I ain't saying it couldn't be done even better, but is_empty is a worthy
improvement in itself.

> >> If we had a separate mapping layer, this would be working totally from
> >> within python, so even while a little inefficient, it would be
> >> tolerable. As we don't though, its probably causing timeouts on bugs
> >> with lots of attachments. One thing that makes this tricky is that by
> >> returning a resultset attachments allows itself to be filtered later
> >> on - and so it can be tricky to change the definition.
> >
> > If something times out due to this query being repeated 4 times, I'd say
> > you likely have bigger problems (like this query is too slow on its
> > own).  Improving something by 4x is great, but this particular query
> > should be <100ms, imho.  300ms extra overhead is bad, but shouldn't make
> > a page timeout.
> 
> 300ms is bad: its 50% more than I think we should be aiming for as the
> *entire render time* for a page.
> (If we have 99% of requests rendering in < 1 sec, our actual average
> performance will be better still - and I think 200ms is about right as
> a goal).

Well, 300ms is not the cause of the problem for a page with 5s of SQL
time, and 15s of non-SQL time
(https://lp-oops.canonical.com/oops.py/?oopsid=1601G2848).  That was my
point.  I think we should aim to fix those 300ms once we've gotten
everything else sorted out first.

> As for stats, that query itself is < 100ms, but the object reloading
> is significantly more.

That's what I thought with some other pages, and discovered that it's
not really true.  Storm object serialization was very fast and removing
it didn't help page performance (in a noticeable way, at least).

> We don't current *see* that in our OOPS traces.
> Its also tricky to get data for because its such a hot code path,
> Gustavo has serious concerns that instrumentation in that area will
> savagely hurt performance.

We can always cowboy stuff on staging and test directly.  A poor-man's
instrumentation. ;)

Also, let's suppose it's not object creation, but Librarian instead.  Do
we have instrumentation there, and if not, could we add it and get it
into OOPS reports as well (I know they are usually out of sync [separate
browser request], but the big jump indicates a sync operation, and maybe
we are talking to the actual librarian over network needlessly)?

> > Anyway, I believe we don't have to worry about things like these (unless
> > you've got timeouts which prove otherwise, of course :), as long as the
> > number of queries is static and small.
> 
> https://bugs.edge.launchpad.net/malone/+bug/583553 - I haven't dug
> into the detail yet, but it is precisely that: a bug timing out with
> lots of attachments.

To me, the "big jump" indicates it's the time spent in Python instead
(and you seem to agree).  Though, the first "big jump" (statements
29-30) is between two Persons being selected, and only the second one is
after BugAttachments are selected.  I would guess that URL formatters
can be involved here as well, and I would just test-drive a change with
all fmt:urls removed (for persons and attachements).  Also, this
particular OOPS never hit any of the other BugAttachment queries, so I'd
guess that it only goes to the tal:condition and times out afterwards.
I'd suggest adding a pdb.set_trace() in the property definition and
stepping through it on staging (at least until OOPSes can record more
detailed data).

I still suspect librarian slowness (and unresponsiveness), but I don't
know its architecture well enough to be able to guess further. :)

> >> I'd love it if we can come up with some way, some convention that will
> >> make it crystal clear when something is:
> >>  - cheap
> >>  - not cheap
> >
> > We kind of assumed that using a property meant that something was cheap,
> > and using a method meant that it could be expensive.  That's what we
> > tried to do in translations.
> 
> I think thats what in-python stuff absolutely should do. However
> there's a problem - we don't prepopulate/eager load our object graphs,
> so broadly speaking 'none' of our attributes which provide data
> related to other stored objects meet that criteria.

The question of how many are really fast is the question of criteria at
the same time (and criteria seems to be changing: 1ms even when it goes
through database used to be considered fast; if we are aiming for <200ms
render time, 1ms is way too slow).

> I hesitate to suggest that we should mass-migrate in that direction,
> if we are going to find some way to effectively eager-load. Or perhaps
> we should migrate, and when eager loading is working we can migrate
> back piecemeal.

Yeah, I think eager loading is the way to go.  Until we come up with a
convention on how to do it and keep nice APIs, I think we should just do
it in whatever hackish way we can think of.

Of course, the pattern you identified in the code is bad, and we know
how it can be improved.  I am just saying that in this particular case
it doesn't seem as the first thing to focus on.

Cheers,
Danilo






Follow ups

References