← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~sinzui/launchpad/fast-prf into lp:launchpad

 

Curtis Hovey has proposed merging lp:~sinzui/launchpad/fast-prf into lp:launchpad.

Commit message:
Do fewer db queries to make the product release finder fast.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  Bug #1081214 in Launchpad itself: "product-release-finder is slowed by duplicate queries"
  https://bugs.launchpad.net/launchpad/+bug/1081214

For more details, see:
https://code.launchpad.net/~sinzui/launchpad/fast-prf/+merge/135451

The product release finder script works with strings. It looks up
projects, series, releases, etc that match the string. The script
attempts to minimise the number of objects it works with, but the
approach actually makes more queries than are necessary because the PRF
will always walk a directory. While each check of a matched file is
economical, the loop over all files found for a release file glob will
lookup the project, series, milestone and release multiple times to
discover that files was already uploaded.

--------------------------------------------------------------------

RULES

    Pre-implementation: no one
    * The PRF creates about 1 release per run, so we can expect it
      to do a handful of extra db calls while creating the needed objects
      for one product release each run, between 3 and 5 extra queries.
      We want:
      * 1 query to start the PRF
      * 1 query per project to get the needed information to check if a
        found release needs to be uploaded.
      * 3-5 queries to create the upload
    * Refactor ProductReleaseFinder.getFilters to get all the data in
      a single query. the PRF wants strings, so there is no need to
      populate the Storm object cache.
    * Update handleProduct to get a set of all the known release files
      and pass it to the methods that need it so that checking for
      existing files is cheap and fast.

    ADDENDUM
    * The file pointer is not explicit closed in handleRelease(). This
      can be a problem since addReleaseTarball() can raise an exception,
      which will circumvent the rule to automatically close the file
      pointer when the identifier leaves scope.
      * This may have contributed to performance issues in the past.

QA

    * This is untestable on qastaging, but a dev instance can be used
      to crawl an example glob from production.
    * Wow! the dev instance is much faster. I tested using the Rhythmbox
      release file glob. Pulling down the tarballs was about twice
      as fast. The second run which had no work to do was about 8 times
      faster.


LINT

    lib/lp/registry/scripts/productreleasefinder/finder.py
    lib/lp/registry/tests/test_prf_finder.py
    lib/lp/testing/factory.py


LoC:

    I have about 3,000 lines of credit this week.


TEST

    ./bin/test -vv lp.registry.tests.test_prf_finder


IMPLEMENTATION

I refactored getFilters() to query just for the needed strings. This
reduced the startup time and does not require the storm object cache to
be populated. I then added getReleaseFileNames() to get all the known
file names for the project. Again this is string data so the storm
object cache is not involved. I updated handleProduct() to call
getReleaseFileNames(), then pass it to handleRelease(). The signature
change was an invasive change to several methods and tests :/. I placed
the file pointer code into a try-finally block to ensure it is closed if
an exception is (and has been) raised.
    lib/lp/registry/scripts/productreleasefinder/finder.py
    lib/lp/registry/tests/test_prf_finder.py

I updated the factory to allow me to create distict release file names,
which Lp requires.
    lib/lp/testing/factory.py
-- 
https://code.launchpad.net/~sinzui/launchpad/fast-prf/+merge/135451
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~sinzui/launchpad/fast-prf into lp:launchpad.
=== modified file 'lib/lp/registry/scripts/productreleasefinder/finder.py'
--- lib/lp/registry/scripts/productreleasefinder/finder.py	2012-11-16 21:16:03 +0000
+++ lib/lp/registry/scripts/productreleasefinder/finder.py	2012-11-21 15:47:28 +0000
@@ -7,6 +7,7 @@
     'ProductReleaseFinder'
     ]
 
+from collections import defaultdict
 from datetime import datetime
 import mimetypes
 import os
@@ -22,8 +23,17 @@
 from lp.app.validators.version import sane_version
 from lp.registry.interfaces.product import IProductSet
 from lp.registry.interfaces.series import SeriesStatus
+from lp.registry.model.milestone import Milestone
+from lp.registry.model.product import Product
+from lp.registry.model.productseries import ProductSeries
+from lp.registry.model.productrelease import (
+    ProductRelease,
+    ProductReleaseFile,
+    )
 from lp.registry.scripts.productreleasefinder.filter import FilterPattern
 from lp.registry.scripts.productreleasefinder.hose import Hose
+from lp.services.database.lpstorm import IStore
+from lp.services.librarian.model import LibraryFileAlias
 
 
 processors = '|'.join([
@@ -99,40 +109,37 @@
         Returns a list of (product_name, filters) for each product in
         the database, where the filter keys are series names.
         """
-        todo = []
-
         self.ztm.begin()
-        products = getUtility(IProductSet)
-        for product in products.get_all_active(None, eager_load=False):
-            filters = []
-
-            for series in product.series:
-                if (series.status == SeriesStatus.OBSOLETE
-                    or not series.releasefileglob):
-                    continue
-
-                filters.append(FilterPattern(series.name,
-                                             series.releasefileglob))
-
-            if not len(filters):
-                continue
-
-            self.log.info("%s has %d series with information", product.name,
-                             len(filters))
-
-            todo.append((product.name, filters))
+        found_globs = IStore(Product).find(
+            (Product.name, ProductSeries.name, ProductSeries.releasefileglob),
+            Product.id == ProductSeries.productID,
+            Product.active == True,
+            ProductSeries.status != SeriesStatus.OBSOLETE,
+            ProductSeries.releasefileglob != None
+            )
+        products_with_filters = defaultdict(list)
+        last_product = None
+        for product_name, series_name, glob in found_globs:
+            if last_product and last_product != product_name:
+                self.log.info(
+                    "%s has %d series with information",
+                    last_product, len(products_with_filters[last_product]))
+            last_product = product_name
+            filter_pattern = FilterPattern(series_name, glob)
+            products_with_filters[product_name].append(filter_pattern)
         self.ztm.abort()
-
-        return todo
+        return products_with_filters.items()
 
     def handleProduct(self, product_name, filters):
         """Scan for tarballs and create ProductReleases for the given product.
         """
+        file_names = self.getReleaseFileNames(product_name)
         hose = Hose(filters, log_parent=self.log)
         for series_name, url in hose:
             if series_name is not None:
                 try:
-                    self.handleRelease(product_name, series_name, url)
+                    self.handleRelease(
+                        product_name, series_name, url, file_names)
                 except (KeyboardInterrupt, SystemExit):
                     raise
                 except:
@@ -143,28 +150,29 @@
                 self.log.debug("File in %s found that matched no glob: %s",
                                product_name, url)
 
-    def hasReleaseFile(self, product_name, series_name,
-                          release_name, filename):
+    def getReleaseFileNames(self, product_name):
+        """Return a set of all current release file names for the product."""
+        self.ztm.begin()
+        found_names = IStore(Product).find(
+            (LibraryFileAlias.filename),
+            Product.name == product_name,
+            Product.id == ProductSeries.productID,
+            Milestone.productseriesID == ProductSeries.id,
+            ProductRelease.milestoneID == Milestone.id,
+            ProductReleaseFile.productreleaseID == ProductRelease.id,
+            LibraryFileAlias.id == ProductReleaseFile.libraryfileID
+            )
+        file_names = set(name for name in found_names)
+        self.ztm.abort()
+        return file_names
+
+    def hasReleaseFile(self, filename, file_names):
         """Return True if we have a tarball for the given product release."""
-        has_file = False
-        self.ztm.begin()
-        try:
-            product = getUtility(IProductSet).getByName(product_name)
-            if product is not None:
-                release = product.getRelease(release_name)
-                if release is not None:
-                    for fileinfo in release.files:
-                        if filename == fileinfo.libraryfile.filename:
-                            has_file = True
-                            break
-        finally:
-            self.ztm.abort()
-        return has_file
+        return filename in file_names
 
     def addReleaseTarball(self, product_name, series_name, release_name,
                           filename, size, file, content_type):
         """Create a ProductRelease (if needed), and attach tarball"""
-        # Get the series.
         self.ztm.begin()
         try:
             product = getUtility(IProductSet).getByName(product_name)
@@ -194,7 +202,7 @@
             self.ztm.abort()
             raise
 
-    def handleRelease(self, product_name, series_name, url):
+    def handleRelease(self, product_name, series_name, url, file_names):
         """If the given URL looks like a release tarball, download it
         and create a corresponding ProductRelease."""
         filename = urlparse.urlsplit(url)[2]
@@ -213,7 +221,7 @@
                            version, url)
             return
 
-        if self.hasReleaseFile(product_name, series_name, version, filename):
+        if self.hasReleaseFile(filename, file_names):
             self.log.debug("Already have a tarball for release %s", version)
             return
 
@@ -233,7 +241,11 @@
             self.log.error("Unable to stat downloaded file")
             raise
 
-        fp = open(local, 'r')
-        os.unlink(local)
-        self.addReleaseTarball(product_name, series_name, version,
-                               filename, stat.st_size, fp, mimetype)
+        try:
+            fp = open(local, 'r')
+            os.unlink(local)
+            self.addReleaseTarball(product_name, series_name, version,
+                                   filename, stat.st_size, fp, mimetype)
+            file_names.add(filename)
+        finally:
+            fp.close()

=== modified file 'lib/lp/registry/tests/test_prf_finder.py'
--- lib/lp/registry/tests/test_prf_finder.py	2012-11-16 21:04:35 +0000
+++ lib/lp/registry/tests/test_prf_finder.py	2012-11-21 15:47:28 +0000
@@ -6,6 +6,7 @@
 import shutil
 from StringIO import StringIO
 import tempfile
+import transaction
 import unittest
 
 from zope.component import getUtility
@@ -69,6 +70,23 @@
         # Test that this raises no exceptions.
         prf.findReleases()
 
+    def test_getReleaseFileNames(self):
+        product = self.factory.makeProduct()
+        series1 = self.factory.makeProductSeries(product=product)
+        series2 = self.factory.makeProductSeries(product=product)
+        self.factory.makeProductReleaseFile(
+            productseries=series1, filename='foo-1.0.tar.gz')
+        file2 = self.factory.makeProductReleaseFile(
+            productseries=series2, filename='foo-2.0.tar.gz')
+        self.factory.makeProductReleaseFile(
+            productseries=series2, release=file2.productrelease,
+            filename='foo-2.1.tar.gz')
+        expected = set(['foo-1.0.tar.gz', 'foo-2.0.tar.gz', 'foo-2.1.tar.gz'])
+        transaction.commit()
+        prf = ProductReleaseFinder(self.layer.txn, logging.getLogger())
+        found = prf.getReleaseFileNames(product.name)
+        self.assertEqual(expected, found)
+
 
 class GetFiltersTestCase(TestCaseWithFactory):
 
@@ -147,7 +165,11 @@
                 ProductReleaseFinder.__init__(self, ztm, log)
                 self.seen_releases = []
 
-            def handleRelease(self, product_name, series_name, url):
+            def getReleaseFileNames(self, product_name):
+                return set()
+
+            def handleRelease(self, product_name, series_name, url,
+                              file_name):
                 self.seen_releases.append((product_name, series_name,
                                            os.path.basename(url)))
 
@@ -220,21 +242,13 @@
         alt_file_name = 'evolution-42.0.orig.tar.bz2'
         file_path, file_name = self.create_tarball(
             'evolution-42.0.orig.tar.gz')
-
-        self.assertEqual(
-            prf.hasReleaseFile('evolution', 'trunk', '42.0', file_name),
-            False)
-        self.assertEqual(
-            prf.hasReleaseFile('evolution', 'trunk', '42.0', alt_file_name),
-            False)
-
-        prf.handleRelease('evolution', 'trunk', file_path)
-
-        self.assertEqual(
-            prf.hasReleaseFile('evolution', 'trunk', '42.0', file_name),
+        file_names = set()
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
+        self.assertEqual(
+            prf.hasReleaseFile(file_name, file_names),
             True)
         self.assertEqual(
-            prf.hasReleaseFile('evolution', 'trunk', '42.0', alt_file_name),
+            prf.hasReleaseFile(alt_file_name, file_names),
             False)
 
         # check to see that the release has been created
@@ -278,7 +292,8 @@
         logging.basicConfig(level=logging.CRITICAL)
         prf = ProductReleaseFinder(ztm, logging.getLogger())
         file_path, file_name = self.create_tarball('evolution-2.1.6.tar.gz')
-        prf.handleRelease('evolution', 'trunk', file_path)
+        file_names = prf.getReleaseFileNames('evolution')
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
 
         # verify that we now have files attached to the release:
         evo = getUtility(IProductSet).getByName('evolution')
@@ -294,8 +309,9 @@
         logging.basicConfig(level=logging.CRITICAL)
         prf = ProductReleaseFinder(ztm, logging.getLogger())
         file_path, file_name = self.create_tarball('evolution-42.0.tar.gz')
-        prf.handleRelease('evolution', 'trunk', file_path)
-        prf.handleRelease('evolution', 'trunk', file_path)
+        file_names = prf.getReleaseFileNames('evolution')
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
         evo = getUtility(IProductSet).getByName('evolution')
         trunk = evo.getSeries('trunk')
         release = trunk.getRelease('42.0')
@@ -309,9 +325,10 @@
         logging.basicConfig(level=logging.CRITICAL)
         prf = ProductReleaseFinder(ztm, logging.getLogger())
         file_path, file_name = self.create_tarball('evolution-1.2.3.tar.gz')
-        prf.handleRelease('evolution', 'trunk', file_path)
+        file_names = prf.getReleaseFileNames('evolution')
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
         file_path, file_name = self.create_tarball('evolution-1.2.3.tar.gz')
-        prf.handleRelease('evolution', '1.0', file_path)
+        prf.handleRelease('evolution', '1.0', file_path, file_names)
         product = getUtility(IProductSet).getByName('evolution')
         release = product.getMilestone('1.2.3').product_release
         self.assertEqual(release.files.count(), 1)
@@ -324,8 +341,9 @@
         file_path, file_name = self.create_tarball('evolution-45.0.tar.gz')
         alt_file_path, alt_file_name = self.create_tarball(
             'evolution-45.0.tar.bz2')
-        prf.handleRelease('evolution', 'trunk', file_path)
-        prf.handleRelease('evolution', 'trunk', alt_file_path)
+        file_names = prf.getReleaseFileNames('evolution')
+        prf.handleRelease('evolution', 'trunk', file_path, file_names)
+        prf.handleRelease('evolution', 'trunk', alt_file_path, file_names)
         evo = getUtility(IProductSet).getByName('evolution')
         trunk = evo.getSeries('trunk')
         release = trunk.getRelease('45.0')
@@ -352,7 +370,8 @@
         fp.close()
 
         url = self.release_url + '/evolution420.tar.gz'
-        prf.handleRelease('evolution', 'trunk', url)
+        file_names = prf.getReleaseFileNames('evolution')
+        prf.handleRelease('evolution', 'trunk', url, file_names)
         self.assertEqual(
             "Unable to parse version from %s\n" % url, output.getvalue())
 

=== modified file 'lib/lp/testing/factory.py'
--- lib/lp/testing/factory.py	2012-11-19 20:33:51 +0000
+++ lib/lp/testing/factory.py	2012-11-21 15:47:28 +0000
@@ -940,11 +940,12 @@
                                product=None, productseries=None,
                                milestone=None,
                                release=None,
-                               description="test file"):
+                               description="test file",
+                               filename='test.txt'):
         signature_filename = None
         signature_content = None
         if signed:
-            signature_filename = 'test.txt.asc'
+            signature_filename = '%s.asc' % filename
             signature_content = '123'
         if release is None:
             release = self.makeProductRelease(product=product,
@@ -952,7 +953,7 @@
                                               milestone=milestone)
         with person_logged_in(release.milestone.product.owner):
             release_file = release.addReleaseFile(
-                'test.txt', 'test', 'text/plain',
+                filename, 'test', 'text/plain',
                 uploader=release.milestone.product.owner,
                 signature_filename=signature_filename,
                 signature_content=signature_content,


Follow ups