launchpad-reviewers team mailing list archive

Thread
Date
Re: [Merge] lp:~cjwatson/launchpad/archive-index-by-hash into lp:launchpad

To: Colin Watson <cjwatson@xxxxxxxxxxxxx>
From: Colin Watson <cjwatson@xxxxxxxxxxxxx>
Date: Mon, 21 Mar 2016 18:21:30 -0000
In-reply-to: <20160317145131.22503.90432.launchpad@ackee.canonical.com>
Reply-to: mp+289379@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

Diff comments:

> 
> === modified file 'lib/lp/archivepublisher/publishing.py'
> --- lib/lp/archivepublisher/publishing.py	2016-03-11 11:45:56 +0000
> +++ lib/lp/archivepublisher/publishing.py	2016-03-17 14:51:01 +0000
> @@ -231,6 +243,93 @@
>          return max(len(str(item['size'])) for item in self[key])
>  
>  
> +class ByHash:
> +    """Represents a single by-hash directory tree."""
> +
> +    # Subdirectory names expected by apt.
> +    supported_hashes = ("MD5Sum", "SHA1", "SHA256")

Yes.  I've refactored this by using a new IArchiveHash interface and implementations, where the various different properties of hash algorithms now live.  There are probably some other bits of archivepublisher that could be folded into this in future as well.

> +
> +    def __init__(self, root, key):
> +        self.root = root
> +        self.path = os.path.join(root, key, "by-hash")
> +        self.known_digests = defaultdict(set)
> +
> +    @staticmethod
> +    def getHashFromLFA(lfa, name):
> +        attr = {
> +            "MD5Sum": "md5",
> +            "SHA1": "sha1",
> +            "SHA256": "sha256",
> +            }[name]
> +        return getattr(lfa.content, attr)
> +
> +    def add(self, lfa, copy_from_path=None):
> +        """Ensure that by-hash entries for a single file exist.
> +
> +        :param lfa: The `ILibraryFileAlias` to add.
> +        :param copy_from_path: If not None, copy file content from here
> +            rather than fetching it from the librarian.  This can be used
> +            for newly-added files to avoid needing to commit the transaction
> +            before calling this method.
> +        """
> +        for hashname in self.supported_hashes:
> +            digest = self.getHashFromLFA(lfa, hashname)
> +            digest_path = os.path.join(self.path, hashname, digest)
> +            self.known_digests[hashname].add(digest)
> +            if not os.path.exists(digest_path):
> +                with open_for_writing(digest_path, "wb") as outfile:
> +                    if copy_from_path is not None:
> +                        infile = open(
> +                            os.path.join(self.root, copy_from_path), "rb")
> +                    else:
> +                        lfa.open()
> +                        infile = lfa
> +                    try:
> +                        shutil.copyfileobj(infile, outfile, 4 * 1024 * 1024)
> +                    finally:
> +                        infile.close()

For any large and active suite, the majority of entries are going to exist only in by-hash anyway.  But it's true that the bulk of infrequently-updated PPAs would end up having their dists size doubled by this, at least pre-diskless-archives.

I definitely don't think we should make the by-name path be a symlink, and making any by-hash path be a symlink makes the by-hash publication process considerably more delicate since it would have to convert symlinks into real files.  I'd steered clear of hardlinks to avoid potential problems with something overwriting the by-name path non-atomically and thus modifying the by-hash file, but I've just checked that both apt-ftparchive and NMAF do the right thing here so it should actually be OK.  I've changed this to create a hardlink in the copy_from_path case now.

> +
> +    def exists(self, hashname, digest):
> +        """Do we know about a file with this digest?"""
> +        return digest in self.known_digests[hashname]
> +
> +    def prune(self):
> +        """Remove all by-hash entries that we have not been told to add."""
> +        if any(self.known_digests.values()):
> +            for hashname in self.supported_hashes:
> +                hash_path = os.path.join(self.path, hashname)
> +                if os.path.exists(hash_path):
> +                    for digest in list(os.listdir(hash_path)):
> +                        if not self.exists(hashname, digest):
> +                            os.unlink(os.path.join(hash_path, digest))
> +        elif os.path.exists(self.path):
> +            shutil.rmtree(self.path)

It's probably safe, but I've reimplemented this a bit more conservatively just in case.

> +
> +
> +class ByHashes:
> +    """Represents all by-hash directory trees in an archive."""
> +
> +    def __init__(self, root):
> +        self.root = root
> +        self.children = {}
> +
> +    def getChild(self, path):
> +        key = os.path.dirname(path)
> +        if key not in self.children:
> +            self.children[key] = ByHash(self.root, key)
> +        return self.children[key]
> +
> +    def add(self, path, lfa, copy_from_path=None):
> +        self.getChild(path).add(lfa, copy_from_path=copy_from_path)
> +
> +    def exists(self, path, hashname, digest):
> +        return self.getChild(path).exists(hashname, digest)
> +
> +    def prune(self):
> +        for child in self.children.values():
> +            child.prune()
> +
> +
>  class Publisher(object):
>      """Publisher is the class used to provide the facility to publish
>      files in the pool of a Distribution. The publisher objects will be
> @@ -501,7 +600,18 @@
>              *conditions).config(distinct=True).order_by(
>                  DistroSeries.id, BinaryPackagePublishingHistory.pocket)
>  
> -        for distroseries, pocket in chain(source_suites, binary_suites):
> +        archive_file_suites = []
> +        for container in getUtility(IArchiveFileSet).getContainersToReap(
> +                self.archive, container_prefix=u"release:"):

I don't believe that latter comment to be true.  This is in step A2, whose purpose is to mark additional suites dirty, and is also the step that handles uncondemned deletions.  But the former is definitely true, so I've moved this query to D_writeReleaseFiles instead.

> +            try:
> +                distroseries, pocket = self.distro.getDistroSeriesAndPocket(
> +                    container[len(u"release:"):])
> +                archive_file_suites.append((distroseries, pocket))
> +            except NotFoundError:
> +                pass
> +
> +        for distroseries, pocket in chain(
> +                source_suites, binary_suites, archive_file_suites):
>              if self.isDirty(distroseries, pocket):
>                  continue
>              if (cannot_modify_suite(self.archive, distroseries, pocket)
> @@ -796,6 +906,69 @@
>              return self.distro.displayname
>          return "LP-PPA-%s" % get_ppa_reference(self.archive)
>  
> +    def _updateByHash(self, suite, release_data):
> +        """Update by-hash files for a suite."""

I somewhat disagree with your characterisation of this method.  It's true that it derives some database state from the output of apt-ftparchive or NMAF as appropriate by injecting the generated index files, but that ultimately comes from the database in its turn.  Other than that, there is no deriving of database state from the filesystem going on here; in particular, the database is the repository of all information regarding when by-hash files should be pruned, and the main purpose of this method is to make sure the filesystem is in sync with that.  But perhaps this is a distinction without a difference, and I've updated the documentation anyway.

It is true that there is to the best of my knowledge no spec, although there's a Debian bug asking for one.  I worked out most of this by code inspection and comparing with the output of apt-ftparchive in its new by-hash generation mode (which is too simple-minded to be suitable for use in Launchpad, but is useful for comparison).

> +        archive_file_set = getUtility(IArchiveFileSet)
> +        by_hashes = ByHashes(self._config.archiveroot)
> +        suite_dir = os.path.relpath(
> +            os.path.join(self._config.distsroot, suite),
> +            self._config.archiveroot)
> +        container = "release:%s" % suite
> +
> +        # Remove any condemned files from the database.  We ensure that we
> +        # know about all the relevant by-hash directory trees before doing
> +        # any removals so that we can prune them properly later.
> +        for archive_file in archive_file_set.getByArchive(
> +                self.archive, container=container):
> +            by_hashes.getChild(archive_file.path)
> +        archive_file_set.reap(self.archive, container=container)
> +
> +        # Gather information.
> +        archive_files = archive_file_set.getByArchive(
> +            self.archive, container=container, eager_load=True)
> +        active_files = {}
> +        for active_entry in release_data["SHA256"]:
> +            path = os.path.join(suite_dir, active_entry["name"])
> +            active_files[path] = (active_entry["size"], active_entry["sha256"])
> +
> +        # Ensure that all files recorded in the database are in by-hash.
> +        current_files = {}
> +        for archive_file in archive_files:
> +            by_hashes.add(archive_file.path, archive_file.library_file)
> +            if archive_file.scheduled_deletion_date is None:
> +                current_files[archive_file.path] = archive_file
> +
> +        # Supersede any database records that do not correspond to active
> +        # index files.
> +        superseded_files = set()
> +        for archive_file in archive_files:
> +            path = archive_file.path
> +            if (path not in active_files or
> +                not by_hashes.exists(
> +                    path, "SHA256", active_files[path][1])):
> +                superseded_files.add(archive_file)

That's not an on-disk test.  The purpose of {ByHash,ByHashes}.exists is to test that the file in question is one of those that has been added with .add; removing this test would be catastrophic.  Your comment makes it clear that this method is badly named, though, so I've renamed it.

> +        archive_file_set.scheduleDeletion(
> +            superseded_files, timedelta(days=BY_HASH_STAY_OF_EXECUTION))

I think you were led to this conclusion by misunderstanding the purpose of ByHashes.exists, as above.  Any ArchiveFile row whose stay of execution has not elapsed is added to the set of known digests earlier, and so will not be added to superseded_files (now condemned_files).

> +
> +        # Ensure that all the active index files are in by-hash and have
> +        # corresponding database entries.
> +        # XXX cjwatson 2016-03-15: This should possibly use bulk creation,
> +        # although we can only avoid about a third of the queries since the
> +        # librarian client has no bulk upload methods.
> +        for path, (size, sha256) in active_files.items():
> +            full_path = os.path.join(self._config.archiveroot, path)
> +            if (os.path.exists(full_path) and
> +                    not by_hashes.exists(path, "SHA256", sha256)):

Again, this seems to be mostly based on a misunderstanding of ByHashes.exists, hopefully rectified by renaming it.  This does not check whether a by-hash file is on disk; it checks whether it is known to ByHashes, i.e. whether it has been added earlier in this method.

We do check that the by-name path exists on disk, but that's because it may be an uncompressed file which is in the Release data but not on disk and thus can't be opened for the purpose of adding it to the librarian and by-hash.

> +                archive_file = archive_file_set.newFromFile(
> +                    self.archive, container, self._config.archiveroot, path,
> +                    size, filenameToContentType(path))
> +                by_hashes.add(
> +                    path, archive_file.library_file, copy_from_path=path)
> +
> +        # Finally, remove any files from disk that aren't recorded in the
> +        # database and aren't active.
> +        by_hashes.prune()
> +
>      def _writeReleaseFile(self, suite, release_data):
>          """Write a Release file to the archive.
>  


-- 
https://code.launchpad.net/~cjwatson/launchpad/archive-index-by-hash/+merge/289379
Your team Launchpad code reviewers is subscribed to branch lp:launchpad.
References

[Merge] lp:~cjwatson/launchpad/archive-index-by-hash into lp:launchpad
From: Colin Watson, 2016-03-17