launchpad-dev team mailing list archive

Thread
Date

Re: help needed on https://bugs.edge.launchpad.net/soyuz/+bug/590708

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: Danilo Šegan <danilo@xxxxxxxxxxxxx>
Date: Tue, 20 Jul 2010 15:00:23 +0200
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <AANLkTikTuwu0eko9My7Qt1uqb1EkjuBJUGkBGUI7ulD-@mail.gmail.com>
Organization: Canonical Ltd.

У уто, 20. 07 2010. у 12:54 +0200, Robert Collins пише:
> The referenced bug is causing many timeouts every day at the moment
> and the folk who have looked at it so far have no particular ideas on
> moving it forward.
> 
> I'd like to request that foundations see if they can help here, as an
> interrupt (sorry!).
> 
> There are two particular angles:
>  - the sql query being performed is very slow - 6 seconds. Can we fix
> this or do we need a model change to answer the question more
> efficiently? The query plan is doing table scans in the inner loop
> which sets of alarms for me.

I could get the fetch query to run much faster on staging, but count(*)
is only twice as fast.  I've added a comment on the bug, but here's what
I did.

I rewritten query so I could understand it first:

SELECT
  BinaryPackageBuild.*
  FROM BinaryPackageBuild
  JOIN PackageBuild
    ON BinaryPackageBuild.package_build = PackageBuild.id
  JOIN Archive
    ON Archive.id = PackageBuild.archive
  JOIN BuildFarmJob
    ON PackageBuild.build_farm_job = BuildFarmJob.id
  WHERE
    distro_arch_series IN (109, 110, 111, 112, 113, 114) AND
    (BuildFarmJob.status <> 1 OR
     BuildFarmJob.date_finished IS NOT NULL) AND
    BuildFarmJob.status=2 AND
    Archive.purpose IN (1,4) AND
    (1=1)
  ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0


Did an EXPLAIN ANALYZE on it and noticed that the following index scans
are an order of magnitude slower than most other index scans:

                          ->  Index Scan using
archive__distribution__purpose__key on archive  (cost=0.00..47.91
rows=29 width=4) (actual time=0.016..0.066 rows=32 loops=1)
                           ->  Index Scan using
packagebuild__archive__idx on packagebuild  (cost=0.00..387.43 rows=702
width=12) (actual time=0.006..32.616 rows=28253 loops=32)
                                 Index Cond: (packagebuild.archive =
archive.id)

So, I moved the PackageBuild archive selection into a subselect, and it
gave me a big improvement on the second re-run on staging (indexes
cached, something we should usually have on production):


SELECT
  BinaryPackageBuild.*
  FROM BinaryPackageBuild
  JOIN PackageBuild
    ON BinaryPackageBuild.package_build = PackageBuild.id
  JOIN BuildFarmJob
    ON PackageBuild.build_farm_job = BuildFarmJob.id
  WHERE
    distro_arch_series IN (109, 110, 111, 112, 113, 114) AND
    (BuildFarmJob.status <> 1 OR
     BuildFarmJob.date_finished IS NOT NULL) AND
    BuildFarmJob.status=2 AND
    PackageBuild.archive IN (
      SELECT Archive.id
        FROM PackageBuild
        JOIN Archive
          ON Archive.id = PackageBuild.archive
        WHERE Archive.purpose IN (1,4) AND
              PackageBuild.id = BinaryPackageBuild.package_build)
    AND (1=1)
  ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0
;

This query has a slower first run than the above query (so Postgres is
probably right in choosing the other one), but with warm caches it runs
in 200-300ms which is an order of magnitude improvement.  Knowing the
size of our production DB RAM, I'd go with it anyway.

I believe I haven't messed anything up and these two are compatible :)
If not, oh well :)

>  - we're repeating the same query because of some lazr.restful glue:
> this seems wasteful and I don't understand why its necessary. Can we
> not 'just' issue the query, and tell the client we don't know the size
> of the collection (for expensive API's). Or something.

Count(*) still takes a lot of time with this.  On warm caches, with a
subselect, I got it down to 2s (from 4s with the old query).  But, at
least the selection runs much faster so running the two combined should
be less than 3s.

Cheers,
Danilo

References

help needed on https://bugs.edge.launchpad.net/soyuz/+bug/590708
From: Robert Collins, 2010-07-20