← Back to team overview

launchpad-dev team mailing list archive

Re: What about non-DB performance? (was Re: sql derived data, eager loading and cachedproperty)

 

On Fri, Aug 13, 2010 at 12:23 AM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> Hi Robert,
>
> I think it's great that you are focusing on performance.  And while
> database used to be the main contention point in the past, my feeling is
> that it is not so anymore.  I.e. we know how to fix DB performance
> issues, and many remain simply because of the lack of time to go through
> all of them.  I might, of course, be mistaken, and people don't really
> have the same experience tackling performance problems like we do in
> Translations. :)

I've been largely data driven in what I'm focusing on. For instance,
search (+filebug, answers) was a -huge- oops culprit a month ago, but
I'm very hopeful that the OOPS reports today will have a pleasant
surprise.

See the attached lpnet page performance report (it has js scripting -
open it in a browser). This report tells us a few things. (Sort by the
99% column, oh and It is stale - it was generated yesterday before the
rollout).

Sort by 99%-complete-within-time
Firstly the top 4 pages in that report by *time* are not DB driven.

But sort by hits:
The first one is DB driven, the second is not. The third is, the forth is not.

So, DB issues are certainly present in many of the problem pages that
get the most exercised by our users.

TranslationGroup:+index is one that may interest you - looks like
partly DB partly python.

> The discussion you are starting is very useful, because we should
> ultimately be able to design APIs we like while not sacrificing
> performance.  Today, in Launchpad, you usually have to come up with ugly
> APIs to accommodate performance solutions you are employing.

Yeah, and if its ugly we can't tell what it does easily, nor feel
happy about working on it and changing it.

> У чет, 12. 08 2010. у 15:34 +1200, Robert Collins пише:
>
>> I have some long term ideas about this: broadly, to separate out
>> 'looks like python' from 'looks like it does DB access' so that its
>> obvious when db access is going to happen, and structure things so
>> that:
>>  - if it looks cheap, it is cheap - always
>>  - no database access is needed during template rendering (as a safety
>> measure for the first point : templates look cheap always)
>>  - set based, fast behaviour is easy and the default
>
> It would be very nice to come up with a sane approach for these kinds of
> things which doesn't include methods named like
> "getPOFilesAndPOTemplatesAndPersons", which is roughly the manual
> approach we are taking today (i.e. getting a ResultSet containing all
> objects we care about in one single query).  My opinion is that we could
> just start with Free's work on Storm (cool stuff, Free!), expand it so
> it works well even with cases like ubuntu_coc (though, it seems to me it
> already supports a lot of it, and it does support more than one argument
> prejoin), and just come up with a coding convention in Launchpad to
> express pre-joins in a generic way.  If that means starting with
> something simple like model API methods defined like:
>
>  def getPOTemplates(populate=[POTemplate.sourcepackagename])
>
> that's probably good enough for now (i.e. standardize on the argument
> name, whatever it is — "populate" is just an example — and always make
> it a column list or something).

So, is_ubuntu_coc_signer.
        query = AND(SignedCodeOfConduct.q.active==True,
                    SignedCodeOfConduct.q.ownerID==self.id,
                    SignedCodeOfConduct.q.datecreated>=lastdate)

Or better:
Exists(Select(1, SignedCodeOfConduct.q.active==True,
                    SignedCodeOfConduct.q.ownerID==self.id,
                    SignedCodeOfConduct.q.datecreated>=lastdate))
(to allow embedding into larger queries).

Storm ReferenceSets don't (currently) permit constraints, so while we
can write a collection for *all* signatures a person has made, we
can't tell storm how to calculate the attribute for us. We could layer
it:
signatures = ReferenceSet(..)
@property
def is_ubuntu_coc_signer(self):
    return bool(signature for signature in self.signatures
        if signature.active and signature > lastcocdate)

but this has a few downsides. Firstly, while we we don't expect
members to do many coc signings, this is only an *example* of derived
data in our system, many other ones have hundreds or thousands of
related unconstrained rows.
Secondly, it means more rows coming out of the DB, which uses
resources, and these things add up. Do that to 4 related tables and
the number of rows coming back can be many more than we want, which
causes memory pressure and overhead deuping them - at minimum.

> However, even when we use the optimal DB path, we have problems with the
> rendering time itself.  Check out
> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1685EB1355
> for an example.  That is 275ms of SQL time (and no, it's not a bug in
> the OOPS tools: that is the SQL time),

Yup 'TimeoutError' means 'no time budget left' and
'LaunchpadTimeoutError' means 'cancelled due to the DB running over'.
It is a little confusing to new players ;).

> and ~17s of rendering time, and
> we believe fmt:url for each object is one of big culprits there (well,
> removing it cuts the rendering time to 5s :).  However, to avoid doing
> it, we'd have to do something like
> getPOTemplatesAndSourcePackageIfNeeded(), so generalizing "prejoins"
> using Storm would help here as well, but ultimately, we need to fix
> fmt:url to not take this long.

Have you got a kcachegrind profile for it? I posted instructions for
doing that in my first performance tuesday thread: you need LOSA
support, but its very easy for them to do.

Doing that will give you the data to see whether its fmt:url or
something else - which may be a little surprising, there are often
easter eggs in profile results ;).

> Granted, this is rendering ~2000 objects and doing fmt:url on each one
> of them, and we can argue that it's not a good idea (there are reasons
> why we are putting it all on one page).  We knew that DB could handle it
> just fine, but we didn't expect the Python code to be this slow (to be
> honest, initial version of this did even more fmt:urls and even used
> menu:navigation which caused the page rendering time to go into hundreds
> of seconds).

\o/

> I know there are plans to fix/look into many of these things (I've also
> talked it over with Gary during the Epic), however, I somehow feel that
> you are neglecting this entire area of Launchpad performance.

I'm sad you feel that way! The main things I've done in this area so far are:
 - the kcachegrind profiling patch (landed - thanks Maris!)
 - ++profile++ to do profiling on-demand on staging (stalled, but the
bug is filed and would love someone to pop up and get the last inch
closed on it)

But I'd also like to note that sometimes you *need* the profile to
tell whats going on: we had an OOPS in registry - you know the one ;)
- that was breaking ubuntu/lucid; it showed up as a pure-python
problem (like yours) with low SQL time; changing the SQL to do a
count() rather than listify made the problem go away. Whats up with
that? We don't know - it could be:
 - count(*) was really faster (perhaps)
 - the sql times we see don't include parsing and objectification :
they are only wire-times for the sql data?
 - the object count was enough to trigger a gc and we were in swap?
 - ???

Anyhow, my point is that we need a holistic view: any given operation
can be slow for many many reasons; the things I've been personally
looking at have been nearly all DB related (the big exception being
the private librarian work).

++profile++ is pretty important I think, with that it will be easier
to get an overview. Also we should really make sure that webservice
calls we *make* record their times to the oops log, just like sql.
(E.g. librarian calls, google search calls, rabbit calls in the
future).

> I.e. even if we went and fixed the DB side of things for pages like
>
>  https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1685EC1691
>
> we'd still be left with >5s of generation time on the appserver.
>
> (If it was only pages like these, I would just say "let's worry about it
> later", but because of the pages like the first one, this is an actual
> problem that we hit right after we resolve all the DB issues on a single
> page)

To me 'a page is fixed when it completes in < 1 second 99% of the
time'. Thats my long term vision for serer side render performance in
Launchpad, and its going to require excellent DB use, efficient and
lean Python code, judicious use of denormalisation and other
precalculation techniques. (And we can't do it by just fragmenting all
the pages into json snippets - that just makes the user experience
very blah.)

I think thats a vision that you broadly share (or you wouldn't be
worried that I'm ignoring part of it!) so I'm sure we'll get there.

For now, the primary problem I see is that you are left guessing at
the timing data, not *knowing*, so getting a profile for it - and
making it easier to profile in future - are important to help out in
this case.

-Rob

Follow ups

References