launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #04186
Performance tuesday! : Re: Edge Errors for 2010-08-08
As part of performance day, I wanted to pick something new to work on,
and the oops reports (which sadly are not able to be made public as
yet - they could disclose personal data) provide a good basis to do
that. So I analysed the timeouts from the most recent edge report I
had handy. Edge because it has most of our fixes.
I've elided non performance, and personal data, from this.
A few over-view thoughts:
- the misgrouping of timeout OOPS makes this *really really hard*. I
wonder if turning it off entirely would help, and just group by urls ?
Nearly all of the OOPS's I looked at were linked to an irrelevant bug.
- getting ++profile++ in place on staging will really help for
analysing the non-sql-slowness we're seeing on a growing fraction of
OOPS.
- Many were death-by-sql-roundtrips
- some were python
- a couple look like db-tuning exercises
- some were just plain weird
I'll happily dig out detail from OOPS if anyone wants to look into a
particular timeout.
On Mon, Aug 9, 2010 at 12:55 PM, OOPS Summaries <matsubara@xxxxxxxxxxxxx> wrote:
> * 64 Time Outs
This is great!
> === Top 10 Time Out Counts by Page ID ===
>
> Hard / Soft Page ID
> 13 / 44 BugTask:+index
> 12 / 6 DistroSeries:+index
> 8 / 1974 DistroSeries:EntryResource:getBuildRecords
> 8 / 160 Person:+participation
> 7 / 4 BugTask:+editstatus-page
> 3 / 77 Sprint:+temp-meeting-export
> 3 / 4 DistributionSourcePackage:+filebug
> 3 / 3 Person:+map
> 2 / 0 BugTask:+distrotask
> 1 / 0 Milestone:+index
> === Top 15 Time Outs (total of 25 unique items) ===
>
> 10 SELECT Packaging.datecreated, Packaging.distroseries, Packaging.id, Packaging.owner, Packaging.pa ... ductSeries.product = Product.id WHERE Packaging.distroseries = %s ORDER BY SourcePackageName.name:
> GET: 10 Robots: 1 Local: 5
> 5 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index)
> OOPS-1681EB2995, OOPS-1681EB3871, OOPS-1681EB3901, OOPS-1681EC3753, OOPS-1681ED3039
> 5 https://edge.launchpad.net/ubuntu/maverick/+index (DistroSeries:+index)
> OOPS-1681EB2638, OOPS-1681EC2433, OOPS-1681ED1407, OOPS-1681ED2293, OOPS-1681ED2619
- this is https://bugs.edge.launchpad.net/launchpad-registry/+bug/612358
which is inprogress, but possibly we're getting confused signals from
the oops - the 'obvious' reading may be wrong and the portlet is
simply the first-after culprit from some slow non-sql code.
> 10 SELECT PublishedPackage.archive, PublishedPackage.binarypackagedescription, PublishedPackage.bina ... = $INT AND PublishedPackage.archive IN ($INT, $INT) ORDER BY PublishedPackage.id DESC LIMIT $INT:
> POST: 10 Robots: 0 Local: 10
> 3 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+filebug/BLOBID (DistributionSourcePackage:+filebug)
> OOPS-1681EA1958, OOPS-1681EA1963, OOPS-1681EB1966
> 1 https://bugs.edge.launchpad.net/ubuntu/+bug/615153/+editstatus (BugTask:+editstatus-page)
> OOPS-1681ED4365
> 1 https://bugs.edge.launchpad.net/ubuntu/+source/firefox-3.5/+bug/614861/+editstatus (BugTask:+editstatus-page)
> OOPS-1681EB1816
> [5 other URLs]
This is fixed in db-devel - its in our inventory now and should
disappear completely midweek :). Its caused by guessPackageNames.
> 9 SELECT Person.account, Person.addressline1, Person.addressline2, Person.city, Person.country, Per ... r, Person.verbose_bugnotifications, Person.visibility FROM Person WHERE Person.id = %s LIMIT $INT:
> GET: 9 Robots: 0 Local: 4
> 2 https://bugs.edge.launchpad.net/ubuntu/+source/flashplugin-nonfree/+bug/410407/+index (BugTask:+index)
> OOPS-1681ED2737, OOPS-1681ED3789
> 2 https://edge.launchpad.net/sprints/uds-m/+temp-meeting-export (Sprint:+temp-meeting-export)
> OOPS-1681EC3472, OOPS-1681ED873
> 2 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index)
> OOPS-1681EC3318, OOPS-1681ED3079
> [3 other URLs]
This is, I think, three different oops being misgrouped. The bug page
one is likely being driven by a slow formatter - it has lots of python
time between calls to is_valid_person.
https://bugs.edge.launchpad.net/malone/+bug/607935 it does have lots
of sql calls going on too, which won't be helping.
> 5 SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_se ... tus=$INT AND Archive.purpose IN ($INT,$INT) AND Archive.id = PackageBuild.archive AND ($INT=$INT):
> GET: 5 Robots: 0 Local: 0
> 5 https://api.edge.launchpad.net/1.0/ubuntu/maverick (DistroSeries:EntryResource:getBuildRecords)
> OOPS-1681EA1667, OOPS-1681EB1345, OOPS-1681EB3523, OOPS-1681EC1367, OOPS-1681ED346
-- https://lp-oops.canonical.com/oops.py/?oopsid=1681EA1667 is very
odd - 4 second timeout, there is a bug about this already - I'd love
it if stuart/gary could comment on it -
https://bugs.edge.launchpad.net/launchpad-foundations/+bug/439945
-- https://lp-oops.canonical.com/oops.py/?oopsid=1681EB1345 is more
reasonable: 10 seconds sql 4 seconds nonsql.
- APIs are involved - 4 seconds of sql time this request shouldn't
need: https://bugs.launchpad.net/soyuz/+bug/590708
- also the query mentioned in that same bug is still a problem and
needs fixing. We have two different things on one bug, may want to
split that for QA clarity.
> 4 SELECT MailingList.date_activated, MailingList.date_registered, MailingList.date_reviewed, Mailin ... gList.team = Person.id AND Person.name = $STRING AND Person.teamowner IS NOT NULL AND ($INT=$INT):
> GET: 4 Robots: 0 Local: 4
> 1 https://bugs.edge.launchpad.net/%7Ecody-somerville/+participation (Person:+participation)
> OOPS-1681ED4390
> 1 https://edge.launchpad.net/%7Elifeless/+participation (Person:+participation)
> OOPS-1681EC4106
> 1 https://edge.launchpad.net/%7Emcasadevall/+participation (Person:+participation)
> OOPS-1681EB2147
> [1 other URLs]
This is death by a thousand cuts from mailing list attribute lookups -
needs prepopulation a-la _all_members or similar, its in progress
already. https://bugs.launchpad.net/launchpad-registry/+bug/607879
> 3 SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_b ... kageBuild.archive AND ($INT=$INT) ORDER BY BuildFarmJob.date_finished DESC LIMIT $INT OFFSET $INT:
> GET: 3 Robots: 0 Local: 0
> 3 https://api.edge.launchpad.net/1.0/ubuntu/maverick (DistroSeries:EntryResource:getBuildRecords)
> OOPS-1681EB3256, OOPS-1681EB3322, OOPS-1681EB4189
This is the same issue two sections up - grouping issue again.
> 3 SELECT PersonLocation.date_created, PersonLocation.date_last_modified, PersonLocation.id, PersonL ... isible IS TRUE AND Person.id = PersonLocation.person AND Person.teamowner IS NULL AND ($INT=$INT):
> GET: 3 Robots: 3 Local: 0
> 3 https://edge.launchpad.net/%7Elocoteams/+map (Person:+map)
> OOPS-1681EB2233, OOPS-1681EC4659, OOPS-1681EC951
These oops all seem to spend ~ 30 seconds in lala land after a 570ms
query - we need a staging profile generated for this. ++profile++
would help immensely here.
> 2 SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug = Bug.id AND Bugtask.distroseries = DistroSeries.id AND DistroSeries.distribution = $INT ORDER BY Bug.heat DESC LIMIT $INT:
> POST: 2 Robots: 0 Local: 2
> 2 https://bugs.edge.launchpad.net/ubuntu/+bug/614983/+distrotask (BugTask:+distrotask)
> OOPS-1681ED2225, OOPS-1681ED2288
Also misgrouped in the oops server, these are timing out due to
SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug =
Bug.id AND Bugtask.distroseries = DistroSeries.id AND
DistroSeries.distribution = 3 ORDER BY Bug.heat DESC LIMIT 1;
taking forever - filed a bug and tagged for Stuart to look at.
https://bugs.edge.launchpad.net/malone/+bug/615644
> 2 SELECT Person.account, Person.addressline1, Person.addressline2, Person.city, Person.country, Per ... erson.id AND TeamParticipation.person = %s AND ($INT=$INT) ORDER BY Person.datecreated LIMIT $INT:
> GET: 2 Robots: 0 Local: 2
> 1 https://bugs.edge.launchpad.net/%7Epitti/+participation (Person:+participation)
> OOPS-1681EB1999
> 1 https://edge.launchpad.net/%7Emdz/+participation (Person:+participation)
> OOPS-1681ED615
This is +participation again, see under 'misgrouping'
> 1 SELECT %s FROM (SELECT BranchRevision.branch, BranchRevision.id, BranchRevision.revision, BranchR ... OT IN ( SELECT revision FROM BranchRevision WHERE branch = $INT) LIMIT $INT) AS "_tmp" LIMIT $INT:
> GET: 1 Robots: 0 Local: 1
> 1 https://code.edge.launchpad.net/%7Ejames-w/launchpad/no-more-sampledata-2/+merge/31891/+index (BranchMergeProposal:+index)
> OOPS-1681EC3984
https://bugs.edge.launchpad.net/launchpad-code/+bug/615647 - There is
a single query that is 3ms on staging, but took 13seconds on
production, which may fall under the 'we need more diagnostics to be
able to fix these things' category.
> 1 SELECT %s FROM (SELECT BugSubscription.bug, BugSubscription.date_created, BugSubscription.id, Bug ... (SELECT id FROM Bug WHERE duplicateof = $INT) AND person = $INT LIMIT $INT) AS "_tmp" LIMIT $INT:
> GET: 1 Robots: 0 Local: 0
> 1 https://bugs.edge.launchpad.net/ubuntu/+source/arts/+bug/320915/+index (BugTask:+index)
> OOPS-1681EB4115
The OOPS is again misgrouped by oopstools. This is a death by
lots-of-sql - 800 queries! Probably very amenable to the sort of thing
my /participants branch does. The actual bug is
https://bugs.launchpad.net/bugs/607935. It also has pretty high python
time, which will need some attention to.
And I ran out of steam here with 4 groups still to analyse - but we're
down to 1-per-day, so fairly far down the 'look at it now pile'.
>From here, I think its time to pop the top bug off the timeout pile.
-Rob