launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #04020
[Merge] lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad
Gavin Panella has proposed merging lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad.
Requested reviews:
Launchpad code reviewers (launchpad-reviewers)
Related bugs:
Bug #798301 in Launchpad itself: "Time-out on +localpackagediffs"
https://bugs.launchpad.net/launchpad/+bug/798301
For more details, see:
https://code.launchpad.net/~allenap/launchpad/localpackagediffs-time-out-bug-798301/+merge/65426
A small warning: this branch has been born in an atmosphere of
frustration with PostgreSQL and Storm, most especially Storm.
The query that most_recent_publications() issues results in some
pathological query performance as the number of DSDs queried against
grows. Changing the number of DSDs from 75 to 300 causes a tenfold
increase in execution time.
Some experimentation against production with alternative queries -
forcing inner joins, using temporary tables, using WITH clauses - has
yielded similar or worse performance. The only approach that has shown
promise (and quite a lot of promise: query time down from ~6s to 0.6s)
is breaking the query into small batches and then UNIONing them back
together. A DISTINCT ON clause needs to be applied to the set as a
whole.
However, Storm does not make this easy. See bug 799824 for that. Most
of the complexity in this branch is working around that annoying bug.
--
https://code.launchpad.net/~allenap/launchpad/localpackagediffs-time-out-bug-798301/+merge/65426
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~allenap/launchpad/localpackagediffs-time-out-bug-798301 into lp:launchpad.
=== modified file 'lib/lp/registry/model/distroseriesdifference.py'
--- lib/lp/registry/model/distroseriesdifference.py 2011-06-14 13:47:51 +0000
+++ lib/lp/registry/model/distroseriesdifference.py 2011-06-21 22:50:41 +0000
@@ -10,7 +10,10 @@
]
from collections import defaultdict
-from itertools import chain
+from itertools import (
+ chain,
+ islice,
+ )
from operator import itemgetter
import apt_pkg
@@ -20,12 +23,17 @@
)
from lazr.enum import DBItem
from sqlobject import StringCol
+import storm
from storm.exceptions import NotOneError
from storm.expr import (
+ Alias,
And,
Column,
Desc,
+ Select,
+ SQLRaw,
Table,
+ Union,
)
from storm.locals import (
Int,
@@ -101,6 +109,21 @@
from lp.soyuz.model.sourcepackagerelease import SourcePackageRelease
+class FlushingResultSet(storm.store.ResultSet):
+ """A `storm.store.ResultSet` that flushes the cache before executing.
+
+ If implicit flushes are not blocked this will flush the underlying store
+ before being iterated. The store normally flushes its own caches at the
+ right times, but when constructing `ResultSet`s by hand this behaviour is
+ bypassed.
+ """
+
+ def __iter__(self):
+ if self._store._implicit_flush_block_count == 0:
+ self._store.flush()
+ return super(FlushingResultSet, self).__iter__()
+
+
def most_recent_publications(dsds, in_parent, statuses, match_version=False):
"""The most recent publications for the given `DistroSeriesDifference`s.
@@ -111,12 +134,16 @@
:param in_parent: A boolean indicating if we should look in the parent
series' archive instead of the derived series' archive.
"""
- columns = (
- DistroSeriesDifference.source_package_name_id,
- SourcePackagePublishingHistory,
- )
+ # XXX: GavinPanella 2011-06-21 bug=799824: Storm cannot currently do
+ # DISTINCT ON with a set expression (e.g. on a UNION). A large amount of
+ # the tomfoolery with FindSpec, SQLRaw, FlushingResultSet et al below is
+ # because it's bloody hard to get Storm to just run a query of your
+ # choosing and return a set of *loaded objects*.
+ spec = storm.store.FindSpec(
+ (DistroSeriesDifference.source_package_name_id,
+ SourcePackagePublishingHistory))
+ columns, tables = spec.get_columns_and_tables()
conditions = And(
- DistroSeriesDifference.id.is_in(dsd.id for dsd in dsds),
SourcePackagePublishingHistory.archiveID == Archive.id,
SourcePackagePublishingHistory.sourcepackagereleaseID == (
SourcePackageRelease.id),
@@ -149,18 +176,36 @@
conditions,
SourcePackageRelease.version == version_column,
)
+ # The query performance drops exponentially with number of DSDs, so we
+ # break it up into a number of smaller queries and then UNION them.
+ dsd_ids = [dsd.id for dsd in dsds]
+ batch_count = max(1, len(dsd_ids) / 35)
+ batches = (
+ islice(dsd_ids, batch, None, batch_count)
+ for batch in xrange(batch_count))
+ batch_conditions = (
+ DistroSeriesDifference.id.is_in(batch)
+ for batch in batches)
+ subselects = (
+ Select(columns, And(conditions, batch_condition))
+ for batch_condition in batch_conditions)
+ union = Union(*subselects)
+ union_alias = Alias(union)
+ union_column = lambda column: Column(column.name, union_alias)
# The sort order is critical so that the DISTINCT ON clause selects the
# most recent publication (i.e. the one with the highest id).
order_by = (
- DistroSeriesDifference.source_package_name_id,
- Desc(SourcePackagePublishingHistory.id),
+ union_column(DistroSeriesDifference.source_package_name_id),
+ Desc(union_column(SourcePackagePublishingHistory.id)),
)
distinct_on = (
- DistroSeriesDifference.source_package_name_id,
+ union_column(DistroSeriesDifference.source_package_name_id),
)
+ select = Select(
+ SQLRaw("*"), tables=union_alias, order_by=order_by,
+ distinct=distinct_on)
store = IStore(SourcePackagePublishingHistory)
- return store.find(
- columns, conditions).order_by(*order_by).config(distinct=distinct_on)
+ return FlushingResultSet(store, spec, select=select)
def most_recent_comments(dsds):