← Back to team overview

launchpad-dev team mailing list archive

Performance tuesday! : Re: Edge Errors for 2010-08-08

 

As part of performance day, I wanted to pick something new to work on,
and the oops reports (which sadly are not able to be made public as
yet - they could disclose personal data) provide a good basis to do
that. So I analysed the timeouts from the most recent edge report I
had handy. Edge because it has most of our fixes.

I've elided non performance, and personal data, from this.

A few over-view thoughts:
 - the misgrouping of timeout OOPS makes this *really really hard*. I
wonder if turning it off entirely would help, and just group by urls ?
Nearly all of the OOPS's I looked at were linked to an irrelevant bug.
 - getting ++profile++ in place on staging will really help for
analysing the non-sql-slowness we're seeing on a growing fraction of
OOPS.

- Many were death-by-sql-roundtrips
- some were python
- a couple look like db-tuning exercises
- some were just plain weird

I'll happily dig out detail from OOPS if anyone wants to look into a
particular timeout.

On Mon, Aug 9, 2010 at 12:55 PM, OOPS Summaries <matsubara@xxxxxxxxxxxxx> wrote:

>  * 64 Time Outs

This is great!


> === Top 10 Time Out Counts by Page ID ===
>
>     Hard / Soft  Page ID
>       13 /   44  BugTask:+index
>       12 /    6  DistroSeries:+index
>        8 / 1974  DistroSeries:EntryResource:getBuildRecords
>        8 /  160  Person:+participation
>        7 /    4  BugTask:+editstatus-page
>        3 /   77  Sprint:+temp-meeting-export
>        3 /    4  DistributionSourcePackage:+filebug
>        3 /    3  Person:+map
>        2 /    0  BugTask:+distrotask
>        1 /    0  Milestone:+index

> === Top 15 Time Outs (total of 25 unique items) ===
>
>  10 SELECT Packaging.datecreated, Packaging.distroseries, Packaging.id, Packaging.owner, Packaging.pa ... ductSeries.product = Product.id WHERE Packaging.distroseries = %s ORDER BY SourcePackageName.name:
>    GET: 10 Robots: 1  Local: 5
>       5 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index)
>        OOPS-1681EB2995, OOPS-1681EB3871, OOPS-1681EB3901, OOPS-1681EC3753, OOPS-1681ED3039
>       5 https://edge.launchpad.net/ubuntu/maverick/+index (DistroSeries:+index)
>        OOPS-1681EB2638, OOPS-1681EC2433, OOPS-1681ED1407, OOPS-1681ED2293, OOPS-1681ED2619

 - this is https://bugs.edge.launchpad.net/launchpad-registry/+bug/612358
which is inprogress, but possibly we're getting confused signals from
the oops - the 'obvious' reading may be wrong and the portlet is
simply the first-after culprit from some slow non-sql code.

>  10 SELECT PublishedPackage.archive, PublishedPackage.binarypackagedescription, PublishedPackage.bina ...  = $INT AND PublishedPackage.archive IN ($INT, $INT) ORDER BY PublishedPackage.id DESC LIMIT $INT:
>    POST: 10 Robots: 0  Local: 10
>       3 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+filebug/BLOBID (DistributionSourcePackage:+filebug)
>        OOPS-1681EA1958, OOPS-1681EA1963, OOPS-1681EB1966
>       1 https://bugs.edge.launchpad.net/ubuntu/+bug/615153/+editstatus (BugTask:+editstatus-page)
>        OOPS-1681ED4365
>       1 https://bugs.edge.launchpad.net/ubuntu/+source/firefox-3.5/+bug/614861/+editstatus (BugTask:+editstatus-page)
>        OOPS-1681EB1816
>    [5 other URLs]

This is fixed in db-devel - its in our inventory now and should
disappear completely midweek :). Its caused by guessPackageNames.

>   9 SELECT Person.account, Person.addressline1, Person.addressline2, Person.city, Person.country, Per ... r, Person.verbose_bugnotifications, Person.visibility FROM Person WHERE Person.id = %s LIMIT $INT:
>    GET: 9 Robots: 0  Local: 4
>       2 https://bugs.edge.launchpad.net/ubuntu/+source/flashplugin-nonfree/+bug/410407/+index (BugTask:+index)
>        OOPS-1681ED2737, OOPS-1681ED3789
>       2 https://edge.launchpad.net/sprints/uds-m/+temp-meeting-export (Sprint:+temp-meeting-export)
>        OOPS-1681EC3472, OOPS-1681ED873
>       2 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index)
>        OOPS-1681EC3318, OOPS-1681ED3079
>    [3 other URLs]

This is, I think, three different oops being misgrouped. The bug page
one is likely being driven by a slow formatter - it has lots of python
time between calls to is_valid_person.
https://bugs.edge.launchpad.net/malone/+bug/607935 it does have lots
of sql calls going on too, which won't be helping.

>   5 SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_se ... tus=$INT AND Archive.purpose IN ($INT,$INT) AND Archive.id = PackageBuild.archive AND ($INT=$INT):
>    GET: 5 Robots: 0  Local: 0
>       5 https://api.edge.launchpad.net/1.0/ubuntu/maverick (DistroSeries:EntryResource:getBuildRecords)
>        OOPS-1681EA1667, OOPS-1681EB1345, OOPS-1681EB3523, OOPS-1681EC1367, OOPS-1681ED346

-- https://lp-oops.canonical.com/oops.py/?oopsid=1681EA1667 is very
odd - 4 second timeout, there is a bug about this already - I'd love
it if stuart/gary could comment on it -
https://bugs.edge.launchpad.net/launchpad-foundations/+bug/439945
 -- https://lp-oops.canonical.com/oops.py/?oopsid=1681EB1345 is more
reasonable: 10 seconds sql 4 seconds nonsql.
  -  APIs are involved - 4 seconds of sql time this request shouldn't
need: https://bugs.launchpad.net/soyuz/+bug/590708
  - also the query mentioned in that same bug is still a problem and
needs fixing. We have two different things on one bug, may want to
split that for QA clarity.

>   4 SELECT MailingList.date_activated, MailingList.date_registered, MailingList.date_reviewed, Mailin ... gList.team = Person.id AND Person.name = $STRING AND Person.teamowner IS NOT NULL AND ($INT=$INT):
>    GET: 4 Robots: 0  Local: 4
>       1 https://bugs.edge.launchpad.net/%7Ecody-somerville/+participation (Person:+participation)
>        OOPS-1681ED4390
>       1 https://edge.launchpad.net/%7Elifeless/+participation (Person:+participation)
>        OOPS-1681EC4106
>       1 https://edge.launchpad.net/%7Emcasadevall/+participation (Person:+participation)
>        OOPS-1681EB2147
>    [1 other URLs]

This is death by a thousand cuts from mailing list attribute lookups -
needs prepopulation a-la _all_members or similar, its in progress
already. https://bugs.launchpad.net/launchpad-registry/+bug/607879

>   3 SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_b ... kageBuild.archive AND ($INT=$INT) ORDER BY BuildFarmJob.date_finished DESC LIMIT $INT OFFSET $INT:
>    GET: 3 Robots: 0  Local: 0
>       3 https://api.edge.launchpad.net/1.0/ubuntu/maverick (DistroSeries:EntryResource:getBuildRecords)
>        OOPS-1681EB3256, OOPS-1681EB3322, OOPS-1681EB4189

This is the same issue two sections up - grouping issue again.

>   3 SELECT PersonLocation.date_created, PersonLocation.date_last_modified, PersonLocation.id, PersonL ... isible IS TRUE AND Person.id = PersonLocation.person AND Person.teamowner IS NULL AND ($INT=$INT):
>    GET: 3 Robots: 3  Local: 0
>       3 https://edge.launchpad.net/%7Elocoteams/+map (Person:+map)
>        OOPS-1681EB2233, OOPS-1681EC4659, OOPS-1681EC951

These oops all seem to spend ~ 30 seconds in lala land after a 570ms
query - we need a staging profile generated for this. ++profile++
would help immensely here.

>   2 SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug = Bug.id AND Bugtask.distroseries = DistroSeries.id AND DistroSeries.distribution = $INT ORDER BY Bug.heat DESC LIMIT $INT:
>    POST: 2 Robots: 0  Local: 2
>       2 https://bugs.edge.launchpad.net/ubuntu/+bug/614983/+distrotask (BugTask:+distrotask)
>        OOPS-1681ED2225, OOPS-1681ED2288

Also misgrouped in the oops server, these are timing out due to
SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug =
Bug.id AND Bugtask.distroseries = DistroSeries.id AND
DistroSeries.distribution = 3 ORDER BY Bug.heat DESC LIMIT 1;
taking forever - filed a bug and tagged for Stuart to look at.
https://bugs.edge.launchpad.net/malone/+bug/615644

>   2 SELECT Person.account, Person.addressline1, Person.addressline2, Person.city, Person.country, Per ... erson.id AND TeamParticipation.person = %s AND ($INT=$INT) ORDER BY Person.datecreated LIMIT $INT:
>    GET: 2 Robots: 0  Local: 2
>       1 https://bugs.edge.launchpad.net/%7Epitti/+participation (Person:+participation)
>        OOPS-1681EB1999
>       1 https://edge.launchpad.net/%7Emdz/+participation (Person:+participation)
>        OOPS-1681ED615

This is +participation again, see under 'misgrouping'

>   1 SELECT %s FROM (SELECT BranchRevision.branch, BranchRevision.id, BranchRevision.revision, BranchR ... OT IN ( SELECT revision FROM BranchRevision WHERE branch = $INT) LIMIT $INT) AS "_tmp" LIMIT $INT:
>    GET: 1 Robots: 0  Local: 1
>       1 https://code.edge.launchpad.net/%7Ejames-w/launchpad/no-more-sampledata-2/+merge/31891/+index (BranchMergeProposal:+index)
>        OOPS-1681EC3984

https://bugs.edge.launchpad.net/launchpad-code/+bug/615647 - There is
a single query that is 3ms on staging, but took 13seconds on
production, which may fall under the 'we need more diagnostics to be
able to fix these things' category.

>   1 SELECT %s FROM (SELECT BugSubscription.bug, BugSubscription.date_created, BugSubscription.id, Bug ...  (SELECT id FROM Bug WHERE duplicateof = $INT) AND person = $INT LIMIT $INT) AS "_tmp" LIMIT $INT:
>    GET: 1 Robots: 0  Local: 0
>       1 https://bugs.edge.launchpad.net/ubuntu/+source/arts/+bug/320915/+index (BugTask:+index)
>        OOPS-1681EB4115

The OOPS is again misgrouped by oopstools. This is a death by
lots-of-sql - 800 queries! Probably very amenable to the sort of thing
my /participants branch does. The actual bug is
https://bugs.launchpad.net/bugs/607935. It also has pretty high python
time, which will need some attention to.

And I ran out of steam here with 4 groups still to analyse - but we're
down to 1-per-day, so fairly far down the 'look at it now pile'.

>From here, I think its time to pop the top bug off the timeout pile.

-Rob