launchpad-reviewers team mailing list archive

Thread
Date
Re: [Merge] lp:~cjwatson/launchpad/archive-index-by-hash into lp:launchpad

To: Colin Watson <cjwatson@xxxxxxxxxxxxx>
From: William Grant <me@xxxxxxxxxxxxxxxxxx>
Date: Fri, 18 Mar 2016 09:44:40 -0000
In-reply-to: <20160317145131.22503.90432.launchpad@ackee.canonical.com>
Reply-to: mp+289379@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
Review: Needs Fixing code



Diff comments:

> 
> === modified file 'lib/lp/archivepublisher/publishing.py'
> --- lib/lp/archivepublisher/publishing.py	2016-03-11 11:45:56 +0000
> +++ lib/lp/archivepublisher/publishing.py	2016-03-17 14:51:01 +0000
> @@ -231,6 +243,93 @@
>          return max(len(str(item['size'])) for item in self[key])
>  
>  
> +class ByHash:
> +    """Represents a single by-hash directory tree."""
> +
> +    # Subdirectory names expected by apt.
> +    supported_hashes = ("MD5Sum", "SHA1", "SHA256")

I guess this has to be synchronised with the set of hash algorithms in _writeSuite?

> +
> +    def __init__(self, root, key):
> +        self.root = root
> +        self.path = os.path.join(root, key, "by-hash")
> +        self.known_digests = defaultdict(set)
> +
> +    @staticmethod
> +    def getHashFromLFA(lfa, name):
> +        attr = {
> +            "MD5Sum": "md5",
> +            "SHA1": "sha1",
> +            "SHA256": "sha256",
> +            }[name]
> +        return getattr(lfa.content, attr)
> +
> +    def add(self, lfa, copy_from_path=None):
> +        """Ensure that by-hash entries for a single file exist.
> +
> +        :param lfa: The `ILibraryFileAlias` to add.
> +        :param copy_from_path: If not None, copy file content from here
> +            rather than fetching it from the librarian.  This can be used
> +            for newly-added files to avoid needing to commit the transaction
> +            before calling this method.
> +        """
> +        for hashname in self.supported_hashes:
> +            digest = self.getHashFromLFA(lfa, hashname)
> +            digest_path = os.path.join(self.path, hashname, digest)
> +            self.known_digests[hashname].add(digest)
> +            if not os.path.exists(digest_path):
> +                with open_for_writing(digest_path, "wb") as outfile:
> +                    if copy_from_path is not None:
> +                        infile = open(
> +                            os.path.join(self.root, copy_from_path), "rb")
> +                    else:
> +                        lfa.open()
> +                        infile = lfa
> +                    try:
> +                        shutil.copyfileobj(infile, outfile, 4 * 1024 * 1024)
> +                    finally:
> +                        infile.close()

This is a lot of quadruplicated space and time.

Can we just symlink or hardlink these in one direction or another? Ideally one would just symlink everything, including the by-name path, to the real file in the best hash dir, but perhaps we should leave the by-name path as a real hardlinked file for compatibility and safety.

> +
> +    def exists(self, hashname, digest):
> +        """Do we know about a file with this digest?"""
> +        return digest in self.known_digests[hashname]
> +
> +    def prune(self):
> +        """Remove all by-hash entries that we have not been told to add."""

[...] removing the by-hash directory itself if it ends up empty.

> +        if any(self.known_digests.values()):
> +            for hashname in self.supported_hashes:
> +                hash_path = os.path.join(self.path, hashname)
> +                if os.path.exists(hash_path):
> +                    for digest in list(os.listdir(hash_path)):
> +                        if not self.exists(hashname, digest):
> +                            os.unlink(os.path.join(hash_path, digest))
> +        elif os.path.exists(self.path):
> +            shutil.rmtree(self.path)

Well that's terrifying, but justifiable I suppose.

> +
> +
> +class ByHashes:
> +    """Represents all by-hash directory trees in an archive."""
> +
> +    def __init__(self, root):
> +        self.root = root
> +        self.children = {}
> +
> +    def getChild(self, path):
> +        key = os.path.dirname(path)
> +        if key not in self.children:
> +            self.children[key] = ByHash(self.root, key)
> +        return self.children[key]

It's quite unobvious that no-op getChilds have side effects vital to the pruning process.

> +
> +    def add(self, path, lfa, copy_from_path=None):
> +        self.getChild(path).add(lfa, copy_from_path=copy_from_path)
> +
> +    def exists(self, path, hashname, digest):
> +        return self.getChild(path).exists(hashname, digest)
> +
> +    def prune(self):
> +        for child in self.children.values():
> +            child.prune()
> +
> +
>  class Publisher(object):
>      """Publisher is the class used to provide the facility to publish
>      files in the pool of a Distribution. The publisher objects will be
> @@ -501,7 +600,18 @@
>              *conditions).config(distinct=True).order_by(
>                  DistroSeries.id, BinaryPackagePublishingHistory.pocket)
>  
> -        for distroseries, pocket in chain(source_suites, binary_suites):
> +        archive_file_suites = []
> +        for container in getUtility(IArchiveFileSet).getContainersToReap(
> +                self.archive, container_prefix=u"release:"):

This will force index regeneration for suites that have reapable by-hash files. https://bugs.launchpad.net/launchpad/+bug/685033 is a similar case with uncondemned deletions that we ran into with NBS binaries in various suites a few weeks ago.

Additionally, archives that aren't otherwise dirty don't get this far: old indices won't be pruned from an archive unless the archive has fresh publications or uncondemned deletions after the stay of execution elapses.

> +            try:
> +                distroseries, pocket = self.distro.getDistroSeriesAndPocket(
> +                    container[len(u"release:"):])
> +                archive_file_suites.append((distroseries, pocket))
> +            except NotFoundError:
> +                pass

Should this crash or at least log an error? It shouldn't happen unless series are deleted...

> +
> +        for distroseries, pocket in chain(
> +                source_suites, binary_suites, archive_file_suites):
>              if self.isDirty(distroseries, pocket):
>                  continue
>              if (cannot_modify_suite(self.archive, distroseries, pocket)
> @@ -796,6 +906,69 @@
>              return self.distro.displayname
>          return "LP-PPA-%s" % get_ppa_reference(self.archive)
>  
> +    def _updateByHash(self, suite, release_data):
> +        """Update by-hash files for a suite."""

This function is special in that it derives database state from the filesystem, despite being part of a component whose raison d'être is to derive filesystem state from the database. The docstring should mention that it consumes a Release file prototype and all on-disk files referenced by that prototype, then updates the DB ArchiveFiles and the on-disk by-hash directories, including pruning anything that should no longer be present.

It could also be the first place on the Internet that actually documents the by-hash structure. I think the "by-hash" directory is always a sibling of the by-name file, but I can't find a spec.

> +        archive_file_set = getUtility(IArchiveFileSet)
> +        by_hashes = ByHashes(self._config.archiveroot)
> +        suite_dir = os.path.relpath(
> +            os.path.join(self._config.distsroot, suite),
> +            self._config.archiveroot)
> +        container = "release:%s" % suite
> +
> +        # Remove any condemned files from the database.  We ensure that we

Not any condemned files -- just those whose stay of execution has elapsed.

> +        # know about all the relevant by-hash directory trees before doing
> +        # any removals so that we can prune them properly later.
> +        for archive_file in archive_file_set.getByArchive(
> +                self.archive, container=container):
> +            by_hashes.getChild(archive_file.path)
> +        archive_file_set.reap(self.archive, container=container)
> +
> +        # Gather information.

It is clear that information is gathered, but it's not entirely obvious what or why.

> +        archive_files = archive_file_set.getByArchive(
> +            self.archive, container=container, eager_load=True)

archive_files isn't used until the next block, and its confusing similarity to active_files as used in this block is confusing.

> +        active_files = {}
> +        for active_entry in release_data["SHA256"]:
> +            path = os.path.join(suite_dir, active_entry["name"])
> +            active_files[path] = (active_entry["size"], active_entry["sha256"])

I'd be tempted to rename active_files to current_files, since it sounds like it could be anything that's still alive. But it'll be less confusing once archive_files is renamed as I suggest below...

> +
> +        # Ensure that all files recorded in the database are in by-hash.
> +        current_files = {}
> +        for archive_file in archive_files:
> +            by_hashes.add(archive_file.path, archive_file.library_file)
> +            if archive_file.scheduled_deletion_date is None:
> +                current_files[archive_file.path] = archive_file

Worth calling archive_files something like db_files?

Also, current_files is seemingly never read? I assume its intent was to only add missing files if they were still referenced, not just hanging around, but that behaviour was eventually dropped. I think it's fine and a good simplification to just write out anything uncondemned, as you do. So this vestigial piece can probably be dropped as well.

> +
> +        # Supersede any database records that do not correspond to active
> +        # index files.

"Supersede" and "condemn" may not be usefully distinct here, but in the closely related package publication context, from which this code inherits the terms, their meanings differ significantly. eg. condemnation of a superseded source may be withheld until its binaries are condemned. I'd be consistent with the existing usage in this module and say "condemn" rather than "supersede".

> +        superseded_files = set()
> +        for archive_file in archive_files:
> +            path = archive_file.path
> +            if (path not in active_files or
> +                not by_hashes.exists(
> +                    path, "SHA256", active_files[path][1])):
> +                superseded_files.add(archive_file)

Only in specific rare circumstances does the publisher deliberately cope with an on-disk archive not having the expected set of files. I can't anticipate any scenario in which "not by_hashes.exists" is useful, so we should probably not have it.

> +        archive_file_set.scheduleDeletion(
> +            superseded_files, timedelta(days=BY_HASH_STAY_OF_EXECUTION))

This will, I think, cause no by-hash file to ever be deleted. Once condemned, the file will show up in archive_files again in the next run, and have its stay of execution pushed back.

> +
> +        # Ensure that all the active index files are in by-hash and have
> +        # corresponding database entries.
> +        # XXX cjwatson 2016-03-15: This should possibly use bulk creation,
> +        # although we can only avoid about a third of the queries since the
> +        # librarian client has no bulk upload methods.
> +        for path, (size, sha256) in active_files.items():
> +            full_path = os.path.join(self._config.archiveroot, path)
> +            if (os.path.exists(full_path) and
> +                    not by_hashes.exists(path, "SHA256", sha256)):

This is racy. Pieces of content (think e7ab72b8f37c7c9c9f6386fb8e3dfa40bf6fe4b67876703c5927e47cb8664ce4) may appear soon after they disappear, and this just checks whether a file is on disk rather than whether it will be in 9 lines or 24 hours.

For example, if a Packages file publishes empty, publishes non-empty, then quickly publishes empty again, the second empty publish will notice that the by-hash file is already there and not reregister it in the database. 24 hours after the non-empty publish, the by-hash file shared by both empty publishes will be deleted, and apt will break.

> +                archive_file = archive_file_set.newFromFile(
> +                    self.archive, container, self._config.archiveroot, path,
> +                    size, filenameToContentType(path))
> +                by_hashes.add(
> +                    path, archive_file.library_file, copy_from_path=path)
> +
> +        # Finally, remove any files from disk that aren't recorded in the
> +        # database and aren't active.
> +        by_hashes.prune()
> +
>      def _writeReleaseFile(self, suite, release_data):
>          """Write a Release file to the archive.
>  


-- 
https://code.launchpad.net/~cjwatson/launchpad/archive-index-by-hash/+merge/289379
Your team Launchpad code reviewers is subscribed to branch lp:launchpad.
References

[Merge] lp:~cjwatson/launchpad/archive-index-by-hash into lp:launchpad
From: Colin Watson, 2016-03-17