launchpad-reviewers team mailing list archive

Thread
Date
Re: [Merge] lp:~cjwatson/launchpad/refactor-cron-germinate into lp:launchpad

To: Colin Watson <cjwatson@xxxxxxxxxxxxx>
From: "Jeroen T. Vermeulen" <jtv@xxxxxxxxxxxxx>
Date: Wed, 07 Dec 2011 09:22:24 -0000
In-reply-to: <20111206150549.23901.78467.launchpad@ackee.canonical.com>
Reply-to: mp+84624@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
+    def runGerminate(self, override_file, series_name, arch, flavours,
+                     structures):

Rightly or wrongly, the “run” in that name led me to expect that this method would fire up a separate process.  Maybe a very short docstring to dispel that notion?


+        germinator = Germinator(arch)
+
+        # Read archive metadata.
+        archive = TagFile(
+            series_name, self.components, arch,
+            'file://%s' % self.config.archiveroot, cleanup=True)
+        germinator.parse_archive(archive)
+
+        for flavour in flavours:

To me, your branch is of very high quality but this is its Achilles' heel.  The following loop body is massive!  I can't see it all on one page, so it becomes hard even to see just how long it is.  If I judged the indentations right, it makes up the entire rest of the method.

Please extract it.  It's well nigh impossible to see (now or in the future, after some maintenance) whether any variables are carried across loop iterations.  That sort of thing can easily trip you up, especially when it happens by accident.

Once you've extracted the loop body, given its size, it's probably still worth extracting chunks from that.

According to some, the pattern of “comment, block of code, blank line, comment, block of code, blank line” is a strong indication that you need to split things up.  It shows that you have already done so in your own mind, but leaves each individual reader to figure out what the dependencies between consecutive blocks are.

Also, of course, small methods with simple purposes are easier to specify in detail and cheaper to test thoroughly.


+            self.logger.info('Germinating for %s/%s/%s',
+                             flavour, series_name, arch)

We follow a different rule for line-breaking function calls than we do for function definitions.

For calls, line-break right after the opening parenthesis and then indent the arguments list by 4 spaces:

    self.logger.info(
        "Germinating for %s/%s/%s.", flavour, series_name, arch)


+            # Add this to the germinate log as well so that that can be
+            # debugged more easily.  Log a separator line first.
+            self.germinate_logger.info('', extra={'progress': True})
+            self.germinate_logger.info('Germinating for %s/%s/%s',
+                                       flavour, series_name, arch,
+                                       extra={'progress': True})

What does the “extra” parameter do?


+            # Expand dependencies.
+            structure = structures[flavour]
+            germinator.plant_seeds(structure)
+            germinator.grow(structure)
+            germinator.add_extras(structure)
+            # Write output files.
+
+            # The structure file makes it possible to figure out how the
+            # other output files relate to each other.
+            structure.write(self.outputPath(
+                flavour, series_name, arch, 'structure'))
+
+            # "all" and "all.sources" list the full set of binary and source
+            # packages respectively for a given flavour/suite/architecture
+            # combination.
+            all_path = self.outputPath(flavour, series_name, arch, 'all')
+            all_sources_path = self.outputPath(
+                flavour, series_name, arch, 'all.sources')
+            germinator.write_all_list(structure, all_path)
+            germinator.write_all_source_list(structure, all_sources_path)
+
+            # Write the dependency-expanded output for each seed.  Several
+            # of these are used by archive administration tools, and others
+            # are useful for debugging, so it's best to just write them all.
+            for seedname in structure.names:
+                germinator.write_full_list(
+                    structure,
+                    self.outputPath(flavour, series_name, arch, seedname),
+                    seedname)
+
+            def writeOverrides(seedname, key, value):
+                packages = germinator.get_full(structure, seedname)
+                for package in sorted(packages):
+                    print >>override_file, '%s/%s  %s  %s' % (
+                        package, arch, key, value)

For functions we use PEP8 naming: lower_case_with_underscores.


+            # Generate apt-ftparchive "extra overrides" for Task fields.
+            for seedname in structure.names:
+                if seedname == 'extra':
+                    continue

This nested loop is too long as well.  Be especially careful with “break” and “continue” in combination with further nesting: those are particularly sensitive to maintenance mistakes.  Famously brought down AT&T's worldwide phone network once.

The break & continue keywords aren't inherently evil, but surprisingly often they're a sign that your code could usefully be expressed more clearly.  They make it easy to pile up alternate code paths, which the reader must all keep in mind in order to debug or change the code effectively.

The only reason you need this particular “continue” is to filter an item out of structure.names.  Why not do that before you go into this loop?  Yes, that means an extra iteration, but its costs are negligible:

    seednames = [name for name in structure.names if name != 'extra']
    for seedname in seednames:
        # ...

This reduces the number of code paths through the loop that the reader has to worry about.  It also reduces the complexity of the loop body.


+                task_headers = {}
+                with structure[seedname] as seedtext:
+                    for line in seedtext:
+                        if line.lower().startswith('task-') and ':' in line:
+                            key, value = line.split(':', 1)
+                            # e.g. "Task-Name" => "name"
+                            key = key[5:].lower()
+                            task_headers[key] = value.strip()

That's a lot of string manipulation.  It may be clearer as a regex:

    task_header_regex = re.compile("task-([^:]*):(.*)", flags=re.IGNORECASE)
    for line in seedtext:
        match = task_header_regex.match(line)
        if match is not None:
            key, value = match.groups()
            task_headers[key.lower()] = value.strip()


+                if not task_headers:
+                    continue
+
+                # Work out the name of the Task to be generated from this
+                # seed.  If there is a Task-Name header, it wins; otherwise,
+                # seeds with a Task-Per-Derivative header are honoured for
+                # all flavours and put in an appropriate namespace, while
+                # other seeds are only honoured for the first flavour and
+                # have archive-global names.
+                if 'name' in task_headers:
+                    task = task_headers['name']
+                elif 'per-derivative' in task_headers:
+                    task = '%s-%s' % (flavour, seedname)
+                elif flavour == flavours[0]:
+                    task = seedname
+                else:
+                    continue

This “continue” is asymmetric with its alternative code paths, making it harder to keep track of control flow.  Consider extracting the whole computation of “task” into a function that returns None as the last resort.  If that works out, your inner loop's body may only need a single alternate code path.


+                # The list of packages in this task come from this seed plus
+                # any other seeds listed in a Task-Seeds header.
+                scan_seeds = set([seedname])
+                if 'seeds' in task_headers:
+                    scan_seeds.update(task_headers['seeds'].split())
+                for scan_seed in sorted(scan_seeds):
+                    writeOverrides(scan_seed, 'Task', task)

This looks to be the only part that needs to be conditional on the computation of “task.”  If so, you don't even need “continue”; just put these last 5 lines in a simple “if”!


+            # Generate apt-ftparchive "extra overrides" for Build-Essential
+            # fields.
+            if ('build-essential' in structure.names and
+                flavour == flavours[0]):
+                writeOverrides('build-essential', 'Build-Essential', 'yes')

If flavours[0] has a special meaning, that's worth documenting.  Consider documenting it in a docstring, or even making it a separate parameter.

I think for now flavours[0] is always “ubuntu,” but will we want to generalize this code for derived distros later?

That concludes my first pass over this method.  Now that I'm a bit more familiar with the details, I'll have to go over it again to review its actual functionality.
-- 
https://code.launchpad.net/~cjwatson/launchpad/refactor-cron-germinate/+merge/84624
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~cjwatson/launchpad/refactor-cron-germinate into lp:launchpad.
Follow ups

Re: [Merge] lp:~cjwatson/launchpad/refactor-cron-germinate into lp:launchpad
From: Colin Watson, 2011-12-07
References

[Merge] lp:~cjwatson/launchpad/refactor-cron-germinate into lp:launchpad
From: Colin Watson, 2011-12-06