← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~jtv/launchpad/bug-994650-scrub-faster into lp:launchpad

 

Jeroen T. Vermeulen has proposed merging lp:~jtv/launchpad/bug-994650-scrub-faster into lp:launchpad with lp:~jtv/launchpad/bug-994650-scrub-pofiletranslator as a prerequisite.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  Bug #994650 in Launchpad itself: "POFileTranslator falls out of date"
  https://bugs.launchpad.net/launchpad/+bug/994650

For more details, see:
https://code.launchpad.net/~jtv/launchpad/bug-994650-scrub-faster/+merge/105169

The garbo job I'm building to update POFileTranslator records for us was painfully slow.  William pointed out the main reason: the query that fetched the POFiles that the outer loop iterates over.  Each time the outer loop takes little slices from that query's result, Storm re-executes the query.

So we needed to keep that list in memory, but again as William pointed out, there are rather too many POFiles to keep comfortably in memory.  This branch loads the POFiles' ids up front, and then postpones the loading of further details for as long as possible.  It does mean handing ids around for some things that were objects before

In fact, you see three levels of POFile retrieval here:

1. Setup fetches only ids, so that only the ids have to be kept in memory.

2. Per batch, the loop loads just the details needed to find out if the POFile needs updating: its POTemplate id, its Language id, and its POTMsgSet ids.

3. Any POFile that actually needs fixing (hopefully a minority) gets loaded as a proper object.

This kind of thing is never fast enough, so there are obvious optimizations that I may pursue next: cache the POTMsgSet ids as we iterate over POFiles of the same POTemplate, to help the no-change case even more.  And to speed up the case where there are changes, such as on the initial run, first gather those POFiles in the batch that need fixing, and batch-load them.  The batch-loading can also cover templates, languages, distroseries, distributions, productseries, and products.  Most of these are needed only for the purpose of logging which POFile is being updated.

To test:
{{{
# Get database schema set up for this branch.
bzr branch lp:~jtv/launchpad/db-994410
cd db-994410
make schema

cd ..

# Get the prerequisite branch.
bzr branch lp:~jtv/launchpad/bug-994650-scrub-pofiletranslator jtv-scrubber
cd jtv-scrubber

# Merge in the branch under review here.
bzr merge lp:~jtv/launchpad/bug-994650-scrub-faster

# Test.
./bin/test -vvc lp.translations.scripts.tests.test_scrub_pofiletranslator
}}}


Jeroen
-- 
https://code.launchpad.net/~jtv/launchpad/bug-994650-scrub-faster/+merge/105169
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~jtv/launchpad/bug-994650-scrub-faster into lp:launchpad.
=== modified file 'lib/lp/translations/scripts/scrub_pofiletranslator.py'
--- lib/lp/translations/scripts/scrub_pofiletranslator.py	2012-05-09 06:48:26 +0000
+++ lib/lp/translations/scripts/scrub_pofiletranslator.py	2012-05-09 06:48:28 +0000
@@ -25,8 +25,8 @@
     )
 
 
-def get_pofiles():
-    """Retrieve POFiles to scrub.
+def get_pofile_ids():
+    """Retrieve ids of POFiles to scrub.
 
     The result's ordering is aimed at maximizing cache effectiveness:
     by POTemplate name for locality of shared POTMsgSets, and by language
@@ -34,14 +34,55 @@
     """
     store = IStore(POFile)
     query = store.find(
-        POFile,
+        POFile.id,
         POFile.potemplateID == POTemplate.id,
         POTemplate.iscurrent == True)
     return query.order_by(POTemplate.name, POFile.languageID)
 
 
-def get_contributions(pofile):
-    """Map all users' most recent contributions to `pofile`.
+def get_pofile_details(pofile_ids):
+    """Retrieve relevant parts of `POFile`s with given ids.
+
+    :param pofile_ids: Iterable of `POFile` ids.
+    :return: Dict mapping each id in `pofile_ids` to a duple of
+        `POTemplate` id and `Language` id for the associated `POFile`.
+    """
+    store = IStore(POFile)
+    rows = store.find(
+        (POFile.id, POFile.potemplateID, POFile.languageID),
+        POFile.id.is_in(pofile_ids))
+    return dict((row[0], row[1:]) for row in rows)
+
+
+def get_potmsgset_ids(potemplate_id):
+    """Get the ids for each current `POTMsgSet` in a `POTemplate`."""
+    store = IStore(POTemplate)
+    return store.find(
+        TranslationTemplateItem.potmsgsetID,
+        TranslationTemplateItem.potemplateID == potemplate_id,
+        TranslationTemplateItem.sequence > 0)
+
+
+def summarize_contributors(potemplate_id, language_id, potmsgset_ids):
+    """Return the set of ids of persons who contributed to a `POFile`.
+
+    This is a limited version of `get_contributions` that is easier to
+    compute.
+    """
+    store = IStore(POFile)
+    contribs = store.find(
+        TranslationMessage.submitterID,
+        TranslationMessage.potmsgsetID.is_in(potmsgset_ids),
+        TranslationMessage.languageID == language_id,
+        TranslationMessage.msgstr0 != None,
+        Coalesce(TranslationMessage.potemplateID, potemplate_id) ==
+            potemplate_id)
+    contribs.config(distinct=True)
+    return set(contribs)
+
+
+def get_contributions(pofile, potmsgset_ids):
+    """Map all users' most recent contributions to a `POFile`.
 
     Returns a dict mapping `Person` id to the creation time of their most
     recent `TranslationMessage` in `POFile`.
@@ -50,35 +91,36 @@
     a diverged entry in this POFile will nevertheless produce a
     POFileTranslator record.  Fixing that would complicate the work more than
     it is probably worth.
+
+    :param pofile: The `POFile` to find contributions for.
+    :param potmsgset_ids: The ids of the `POTMsgSet`s to look for, as returned
+        by `get_potmsgset_ids`.
     """
     store = IStore(pofile)
-    potmsgset_ids = store.find(
-        TranslationTemplateItem.potmsgsetID,
-        TranslationTemplateItem.potemplateID == pofile.potemplate.id,
-        TranslationTemplateItem.sequence > 0)
+    language_id = pofile.language.id
+    template_id = pofile.potemplate.id
     contribs = store.find(
         (TranslationMessage.submitterID, TranslationMessage.date_created),
         TranslationMessage.potmsgsetID.is_in(potmsgset_ids),
-        TranslationMessage.languageID == pofile.language.id,
+        TranslationMessage.languageID == language_id,
         TranslationMessage.msgstr0 != None,
-        Coalesce(
-            TranslationMessage.potemplateID,
-            pofile.potemplate.id) == pofile.potemplate.id)
+        Coalesce(TranslationMessage.potemplateID, template_id) ==
+            template_id)
     contribs = contribs.config(distinct=(TranslationMessage.submitterID,))
     contribs = contribs.order_by(
         TranslationMessage.submitterID, Desc(TranslationMessage.date_created))
     return dict(contribs)
 
 
-def get_pofiletranslators(pofile):
-    """Get `POFileTranslator` entries for `pofile`.
+def get_pofiletranslators(pofile_id):
+    """Get `POFileTranslator` entries for a `POFile`.
 
     Returns a dict mapping each contributor's person id to their
     `POFileTranslator` record.
     """
-    store = IStore(pofile)
+    store = IStore(POFileTranslator)
     pofts = store.find(
-        POFileTranslator, POFileTranslator.pofileID == pofile.id)
+        POFileTranslator, POFileTranslator.pofileID == pofile_id)
     return dict((poft.personID, poft) for poft in pofts)
 
 
@@ -117,17 +159,27 @@
             date_last_touched=contribs[missing_contributor]))
 
 
-def scrub_pofile(logger, pofile):
+def fix_pofile(logger, pofile_id, potmsgset_ids, pofiletranslators):
+    """This `POFile` needs fixing.  Load its data & fix it."""
+    pofile = IStore(POFile).get(POFile, pofile_id)
+    contribs = get_contributions(pofile, potmsgset_ids)
+    remove_unwarranted_pofiletranslators(
+        logger, pofile, pofiletranslators, contribs)
+    create_missing_pofiletranslators(
+        logger, pofile, pofiletranslators, contribs)
+
+
+def scrub_pofile(logger, pofile_id, template_id, language_id):
     """Scrub `POFileTranslator` entries for one `POFile`.
 
     Removes inappropriate entries and adds missing ones.
     """
-    contribs = get_contributions(pofile)
-    pofiletranslators = get_pofiletranslators(pofile)
-    remove_unwarranted_pofiletranslators(
-        logger, pofile, pofiletranslators, contribs)
-    create_missing_pofiletranslators(
-        logger, pofile, pofiletranslators, contribs)
+    pofiletranslators = get_pofiletranslators(pofile_id)
+    potmsgset_ids = get_potmsgset_ids(template_id)
+    contributors = summarize_contributors(
+        template_id, language_id, potmsgset_ids)
+    if set(pofiletranslators) != set(contributors):
+        fix_pofile(logger, pofile_id, potmsgset_ids, pofiletranslators)
 
 
 class ScrubPOFileTranslator(TunableLoop):
@@ -137,24 +189,23 @@
 
     def __init__(self, *args, **kwargs):
         super(ScrubPOFileTranslator, self).__init__(*args, **kwargs)
-        # This does not listify the POFiles; they are batch-fetched on
-        # demand.  So iteration may not be entirely exact, but it
-        # doesn't really need to be.  It avoids loading all those
-        # POFiles into memory when we only need a few per iteration.
-        self.pofiles = get_pofiles()
+        self.pofile_ids = tuple(get_pofile_ids())
         self.next_offset = 0
 
     def __call__(self, chunk_size):
         """See `ITunableLoop`."""
         start_offset = self.next_offset
         self.next_offset = start_offset + int(chunk_size)
-        batch = list(self.pofiles[start_offset:self.next_offset])
+        batch = self.pofile_ids[start_offset:self.next_offset]
         if len(batch) == 0:
             self.next_offset = None
-        else:
-            for pofile in batch:
-                scrub_pofile(self.log, pofile)
-            transaction.commit()
+            return
+
+        pofile_details = get_pofile_details(batch)
+        for pofile_id in batch:
+            template_id, language_id = pofile_details[pofile_id]
+            scrub_pofile(self.log, pofile_id, template_id, language_id)
+        transaction.commit()
 
     def isDone(self):
         """See `ITunableLoop`."""

=== modified file 'lib/lp/translations/scripts/tests/test_scrub_pofiletranslator.py'
--- lib/lp/translations/scripts/tests/test_scrub_pofiletranslator.py	2012-05-09 06:48:26 +0000
+++ lib/lp/translations/scripts/tests/test_scrub_pofiletranslator.py	2012-05-09 06:48:28 +0000
@@ -21,10 +21,13 @@
 from lp.translations.model.pofiletranslator import POFileTranslator
 from lp.translations.scripts.scrub_pofiletranslator import (
     get_contributions,
-    get_pofiles,
+    get_pofile_details,
+    get_pofile_ids,
     get_pofiletranslators,
+    get_potmsgset_ids,
     scrub_pofile,
     ScrubPOFileTranslator,
+    summarize_contributors,
     )
 
 
@@ -77,18 +80,18 @@
         IStore(poft.pofile).add(poft)
         return poft
 
-    def test_get_pofiles_gets_pofiles_for_active_templates(self):
+    def test_get_pofile_ids_gets_pofiles_for_active_templates(self):
         pofile = self.factory.makePOFile()
-        self.assertIn(pofile, get_pofiles())
+        self.assertIn(pofile.id, get_pofile_ids())
 
-    def test_get_pofiles_skips_inactive_templates(self):
+    def test_get_pofile_ids_skips_inactive_templates(self):
         pofile = self.factory.makePOFile()
         pofile.potemplate.iscurrent = False
-        self.assertNotIn(pofile, get_pofiles())
+        self.assertNotIn(pofile.id, get_pofile_ids())
 
-    def test_get_pofiles_clusters_by_template_name(self):
+    def test_get_pofile_ids_clusters_by_template_name(self):
         # POFiles for templates with the same name are bunched together
-        # in the get_pofiles() output.
+        # in the get_pofile_ids() output.
         templates = [
             self.factory.makePOTemplate(name='shared'),
             self.factory.makePOTemplate(name='other'),
@@ -99,12 +102,13 @@
         pofiles = [
             self.factory.makePOFile(potemplate=template)
             for template in templates]
-        ordering = get_pofiles()
-        self.assertEqual(1, size_distance(ordering, pofiles[0], pofiles[-1]))
+        ordering = get_pofile_ids()
+        self.assertEqual(
+            1, size_distance(ordering, pofiles[0].id, pofiles[-1].id))
 
-    def test_get_pofiles_clusters_by_language(self):
+    def test_get_pofile_ids_clusters_by_language(self):
         # POFiles for sharing templates and the same language are
-        # bunched together in the get_pofiles() output.
+        # bunched together in the get_pofile_ids() output.
         templates = [
             self.factory.makePOTemplate(
                 name='shared', distroseries=self.factory.makeDistroSeries())
@@ -119,16 +123,77 @@
                 pofiles.append(
                     self.factory.makePOFile(language, potemplate=template))
 
-        ordering = get_pofiles()
+        ordering = get_pofile_ids()
         for pofiles in pofiles_per_language.values():
             self.assertEqual(
-                1, size_distance(ordering, pofiles[0], pofiles[1]))
+                1, size_distance(ordering, pofiles[0].id, pofiles[1].id))
+
+    def test_get_pofile_details_maps_id_to_template_and_language_ids(self):
+        pofile = self.factory.makePOFile()
+        self.assertEqual(
+            {pofile.id: (pofile.potemplate.id, pofile.language.id)},
+            get_pofile_details([pofile.id]))
+
+    def test_get_potmsgset_ids_returns_potmsgset_ids(self):
+        pofile = self.factory.makePOFile()
+        potmsgset = self.factory.makePOTMsgSet(
+            potemplate=pofile.potemplate, sequence=1)
+        self.assertContentEqual(
+            [potmsgset.id], get_potmsgset_ids(pofile.potemplate.id))
+
+    def test_get_potmsgset_ids_ignores_inactive_messages(self):
+        pofile = self.factory.makePOFile()
+        self.factory.makePOTMsgSet(
+            potemplate=pofile.potemplate, sequence=0)
+        self.assertContentEqual([], get_potmsgset_ids(pofile.potemplate.id))
+
+    def test_summarize_contributors_gets_contributors(self):
+        pofile = self.factory.makePOFile()
+        tm = self.factory.makeSuggestion(pofile=pofile)
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertContentEqual(
+            [tm.submitter.id],
+            summarize_contributors(
+                pofile.potemplate.id, pofile.language.id, potmsgset_ids))
+
+    def test_summarize_contributors_ignores_inactive_potmsgsets(self):
+        pofile = self.factory.makePOFile()
+        potmsgset = self.factory.makePOTMsgSet(
+            potemplate=pofile.potemplate, sequence=0)
+        self.factory.makeSuggestion(pofile=pofile, potmsgset=potmsgset)
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertContentEqual(
+            [],
+            summarize_contributors(
+                pofile.potemplate.id, pofile.language.id, potmsgset_ids))
+
+    def test_summarize_contributors_includes_diverged_msgs_for_template(self):
+        pofile = self.factory.makePOFile()
+        tm = self.factory.makeSuggestion(pofile=pofile)
+        tm.potemplate = pofile.potemplate
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertContentEqual(
+            [tm.submitter.id],
+            summarize_contributors(
+                pofile.potemplate.id, pofile.language.id, potmsgset_ids))
+
+    def test_summarize_contributors_excludes_other_diverged_messages(self):
+        pofile = self.factory.makePOFile()
+        tm = self.factory.makeSuggestion(pofile=pofile)
+        tm.potemplate = self.factory.makePOTemplate()
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertContentEqual(
+            [],
+            summarize_contributors(
+                pofile.potemplate.id, pofile.language.id, potmsgset_ids))
 
     def test_get_contributions_gets_contributions(self):
         pofile = self.factory.makePOFile()
         tm = self.factory.makeSuggestion(pofile=pofile)
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
         self.assertEqual(
-            {tm.submitter.id: tm.date_created}, get_contributions(pofile))
+            {tm.submitter.id: tm.date_created},
+            get_contributions(pofile, potmsgset_ids))
 
     def test_get_contributions_uses_latest_contribution(self):
         pofile = self.factory.makePOFile()
@@ -138,34 +203,39 @@
             pofile=pofile, date_created=yesterday)
         new_tm = self.factory.makeSuggestion(
             translator=old_tm.submitter, pofile=pofile, date_created=today)
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
         self.assertNotEqual(old_tm.date_created, new_tm.date_created)
         self.assertContentEqual(
-            [new_tm.date_created], get_contributions(pofile).values())
+            [new_tm.date_created],
+            get_contributions(pofile, potmsgset_ids).values())
 
     def test_get_contributions_ignores_inactive_potmsgsets(self):
         pofile = self.factory.makePOFile()
         potmsgset = self.factory.makePOTMsgSet(
             potemplate=pofile.potemplate, sequence=0)
         self.factory.makeSuggestion(pofile=pofile, potmsgset=potmsgset)
-        self.assertEqual({}, get_contributions(pofile))
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertEqual({}, get_contributions(pofile, potmsgset_ids))
 
     def test_get_contributions_includes_diverged_messages_for_template(self):
         pofile = self.factory.makePOFile()
         tm = self.factory.makeSuggestion(pofile=pofile)
         tm.potemplate = pofile.potemplate
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
         self.assertContentEqual(
-            [tm.submitter.id], get_contributions(pofile).keys())
+            [tm.submitter.id], get_contributions(pofile, potmsgset_ids))
 
     def test_get_contributions_excludes_other_diverged_messages(self):
         pofile = self.factory.makePOFile()
         tm = self.factory.makeSuggestion(pofile=pofile)
         tm.potemplate = self.factory.makePOTemplate()
-        self.assertEqual({}, get_contributions(pofile))
+        potmsgset_ids = get_potmsgset_ids(pofile.potemplate.id)
+        self.assertEqual({}, get_contributions(pofile, potmsgset_ids))
 
     def test_get_pofiletranslators_gets_pofiletranslators_for_pofile(self):
         pofile = self.factory.makePOFile()
         tm = self.make_message_with_pofiletranslator(pofile)
-        pofts = get_pofiletranslators(pofile)
+        pofts = get_pofiletranslators(pofile.id)
         self.assertContentEqual([tm.submitter.id], pofts.keys())
         poft = pofts[tm.submitter.id]
         self.assertEqual(pofile, poft.pofile)
@@ -175,7 +245,8 @@
         tm = self.make_message_with_pofiletranslator(pofile)
         old_poft = self.query_pofiletranslator(pofile, tm.submitter).one()
 
-        scrub_pofile(fake_logger, pofile)
+        scrub_pofile(
+            fake_logger, pofile.id, pofile.potemplate.id, pofile.language.id)
 
         new_poft = self.query_pofiletranslator(pofile, tm.submitter).one()
         self.assertEqual(old_poft, new_poft)
@@ -186,14 +257,16 @@
         self.becomeDbUser('postgres')
         poft = self.make_pofiletranslator_without_message()
         (pofile, person) = (poft.pofile, poft.person)
-        scrub_pofile(fake_logger, poft.pofile)
+        scrub_pofile(
+            fake_logger, pofile.id, pofile.potemplate.id, pofile.language.id)
         self.assertIsNone(self.query_pofiletranslator(pofile, person).one())
 
     def test_scrub_pofile_adds_missing_entries(self):
         pofile = self.factory.makePOFile()
         tm = self.make_message_without_pofiletranslator(pofile)
 
-        scrub_pofile(fake_logger, pofile)
+        scrub_pofile(
+            fake_logger, pofile.id, pofile.potemplate.id, pofile.language.id)
 
         new_poft = self.query_pofiletranslator(pofile, tm.submitter).one()
         self.assertEqual(tm.submitter, new_poft.person)


Follow ups