launchpad-reviewers team mailing list archive

Thread
Date

[Merge] lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad

To: mp+65426@xxxxxxxxxxxxxxxxxx
From: Gavin Panella <gavin.panella@xxxxxxxxxxxxx>
Date: Tue, 21 Jun 2011 22:50:45 -0000
Reply-to: mp+65426@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

Gavin Panella has proposed merging lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad.

Requested reviews:
Launchpad code reviewers (launchpad-reviewers)
Related bugs:
Bug #798301 in Launchpad itself: "Time-out on +localpackagediffs"
https://bugs.launchpad.net/launchpad/+bug/798301

For more details, see:
https://code.launchpad.net/~allenap/launchpad/localpackagediffs-time-out-bug-798301/+merge/65426

A small warning: this branch has been born in an atmosphere of
frustration with PostgreSQL and Storm, most especially Storm.

The query that most_recent_publications() issues results in some
pathological query performance as the number of DSDs queried against
grows. Changing the number of DSDs from 75 to 300 causes a tenfold
increase in execution time.

Some experimentation against production with alternative queries -
forcing inner joins, using temporary tables, using WITH clauses - has
yielded similar or worse performance. The only approach that has shown
promise (and quite a lot of promise: query time down from ~6s to 0.6s)
is breaking the query into small batches and then UNIONing them back
together. A DISTINCT ON clause needs to be applied to the set as a
whole.

However, Storm does not make this easy. See bug 799824 for that. Most
of the complexity in this branch is working around that annoying bug.

--
https://code.launchpad.net/~allenap/launchpad/localpackagediffs-time-out-bug-798301/+merge/65426
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad.

=== modified file 'lib/lp/registry/model/distroseriesdifference.py'
--- lib/lp/registry/model/distroseriesdifference.py	2011-06-14 13:47:51 +0000
+++ lib/lp/registry/model/distroseriesdifference.py	2011-06-21 22:50:41 +0000
@@ -10,7 +10,10 @@
     ]
 
 from collections import defaultdict
-from itertools import chain
+from itertools import (
+    chain,
+    islice,
+    )
 from operator import itemgetter
 
 import apt_pkg
@@ -20,12 +23,17 @@
     )
 from lazr.enum import DBItem
 from sqlobject import StringCol
+import storm
 from storm.exceptions import NotOneError
 from storm.expr import (
+    Alias,
     And,
     Column,
     Desc,
+    Select,
+    SQLRaw,
     Table,
+    Union,
     )
 from storm.locals import (
     Int,
@@ -101,6 +109,21 @@
 from lp.soyuz.model.sourcepackagerelease import SourcePackageRelease
 
 
+class FlushingResultSet(storm.store.ResultSet):
+    """A `storm.store.ResultSet` that flushes the cache before executing.
+
+    If implicit flushes are not blocked this will flush the underlying store
+    before being iterated. The store normally flushes its own caches at the
+    right times, but when constructing `ResultSet`s by hand this behaviour is
+    bypassed.
+    """
+
+    def __iter__(self):
+        if self._store._implicit_flush_block_count == 0:
+            self._store.flush()
+        return super(FlushingResultSet, self).__iter__()
+
+
 def most_recent_publications(dsds, in_parent, statuses, match_version=False):
     """The most recent publications for the given `DistroSeriesDifference`s.
 
@@ -111,12 +134,16 @@
     :param in_parent: A boolean indicating if we should look in the parent
         series' archive instead of the derived series' archive.
     """
-    columns = (
-        DistroSeriesDifference.source_package_name_id,
-        SourcePackagePublishingHistory,
-        )
+    # XXX: GavinPanella 2011-06-21 bug=799824: Storm cannot currently do
+    # DISTINCT ON with a set expression (e.g. on a UNION). A large amount of
+    # the tomfoolery with FindSpec, SQLRaw, FlushingResultSet et al below is
+    # because it's bloody hard to get Storm to just run a query of your
+    # choosing and return a set of *loaded objects*.
+    spec = storm.store.FindSpec(
+        (DistroSeriesDifference.source_package_name_id,
+         SourcePackagePublishingHistory))
+    columns, tables = spec.get_columns_and_tables()
     conditions = And(
-        DistroSeriesDifference.id.is_in(dsd.id for dsd in dsds),
         SourcePackagePublishingHistory.archiveID == Archive.id,
         SourcePackagePublishingHistory.sourcepackagereleaseID == (
             SourcePackageRelease.id),
@@ -149,18 +176,36 @@
             conditions,
             SourcePackageRelease.version == version_column,
             )
+    # The query performance drops exponentially with number of DSDs, so we
+    # break it up into a number of smaller queries and then UNION them.
+    dsd_ids = [dsd.id for dsd in dsds]
+    batch_count = max(1, len(dsd_ids) / 35)
+    batches = (
+        islice(dsd_ids, batch, None, batch_count)
+        for batch in xrange(batch_count))
+    batch_conditions = (
+        DistroSeriesDifference.id.is_in(batch)
+        for batch in batches)
+    subselects = (
+        Select(columns, And(conditions, batch_condition))
+        for batch_condition in batch_conditions)
+    union = Union(*subselects)
+    union_alias = Alias(union)
+    union_column = lambda column: Column(column.name, union_alias)
     # The sort order is critical so that the DISTINCT ON clause selects the
     # most recent publication (i.e. the one with the highest id).
     order_by = (
-        DistroSeriesDifference.source_package_name_id,
-        Desc(SourcePackagePublishingHistory.id),
+        union_column(DistroSeriesDifference.source_package_name_id),
+        Desc(union_column(SourcePackagePublishingHistory.id)),
         )
     distinct_on = (
-        DistroSeriesDifference.source_package_name_id,
+        union_column(DistroSeriesDifference.source_package_name_id),
         )
+    select = Select(
+        SQLRaw("*"), tables=union_alias, order_by=order_by,
+        distinct=distinct_on)
     store = IStore(SourcePackagePublishingHistory)
-    return store.find(
-        columns, conditions).order_by(*order_by).config(distinct=distinct_on)
+    return FlushingResultSet(store, spec, select=select)
 
 
 def most_recent_comments(dsds):