← Back to team overview

launchpad-reviewers team mailing list archive

Re: [Merge] ~cjwatson/launchpad:archive-file-history into launchpad:master

 

I found the new _updateByHash structure somewhat difficult to understand, and tried a different approach in https://code.launchpad.net/~wgrant/launchpad/+git/launchpad/+merge/390811. Not nearly landable, but I think the flow is much clearer, and I'd be interested in your thoughts.

Diff comments:

> diff --git a/lib/lp/archivepublisher/publishing.py b/lib/lp/archivepublisher/publishing.py
> index 5908ccc..4735b89 100644
> --- a/lib/lp/archivepublisher/publishing.py
> +++ b/lib/lp/archivepublisher/publishing.py
> @@ -1084,33 +1083,54 @@ class Publisher(object):
>              real_path = os.path.join(suite_dir, real_name)
>              current_files[path] = (
>                  int(current_entry["size"]), current_entry["sha256"], real_path)
> +
> +        # Gather information on entries currently in the database.  Ensure
> +        # that we know about all the relevant by-hash directory trees before
> +        # doing any removals so that we can prune them properly later, and
> +        # work out which condemned files should be reprieved due to the
> +        # paths in question having their previous content again.
> +        reprieved_files = defaultdict(list)
>          uncondemned_files = set()

It took me far too long to work out that uncondemned_files has been completely repurposed: it previously contained the files that were to be uncondemned (and so were to have their condemnation rescinded), but now is those files that are *currently* uncondemned (and so may render the rescinding of another file's condemnation, now in the set called reprieved_files, unnecessary). This explains why I got rapidly confused by the changes when I looked at this a few times over the last couple of years!

>          for db_file in archive_file_set.getByArchive(
> -                self.archive, container=container, only_condemned=True,
> -                eager_load=True):
> -            stripped_path = strip_dists(db_file.path)
> -            if stripped_path in current_files:
> -                current_sha256 = current_files[stripped_path][1]
> -                if db_file.library_file.content.sha256 == current_sha256:
> -                    uncondemned_files.add(db_file)
> -        if uncondemned_files:
> -            for container, path, sha256 in archive_file_set.unscheduleDeletion(
> -                    uncondemned_files):
> +                self.archive, container=container, eager_load=True):
> +            by_hashes.registerChild(os.path.dirname(strip_dists(db_file.path)))
> +            file_key = (db_file.path, db_file.library_file.content.sha256)
> +            if db_file.scheduled_deletion_date is None:
> +                uncondemned_files.add(file_key)
> +            else:
> +                stripped_path = strip_dists(db_file.path)
> +                if stripped_path in current_files:
> +                    current_sha256 = current_files[stripped_path][1]
> +                    if db_file.library_file.content.sha256 == current_sha256:
> +                        reprieved_files[file_key].append(db_file)
> +
> +        # We may already have uncondemned entries with the same path and
> +        # content as condemned entries that we were about to reprieve; if
> +        # so, there's no need to reprieve them.
> +        for file_key in uncondemned_files:
> +            reprieved_files.pop(file_key, None)
> +
> +        # Make sure nothing in the current Release file is condemned.
> +        if reprieved_files:
> +            reprieved_files_flat = set(
> +                chain.from_iterable(reprieved_files.values()))
> +            archive_file_set.unscheduleDeletion(reprieved_files_flat)
> +            for db_file in reprieved_files_flat:
>                  self.log.debug(
>                      "by-hash: Unscheduled %s for %s in %s for deletion" % (
> -                        sha256, path, container))
> +                        db_file.library_file.content.sha256, db_file.path,
> +                        db_file.container))

Is separate reprieval interesting now that it just creates a new row anyway? It makes already complicated code significantly more involved, just to avoid either uploading a file again.

>  
>          # Remove any condemned files from the database whose stay of
>          # execution has elapsed.  We ensure that we know about all the
>          # relevant by-hash directory trees before doing any removals so that
>          # we can prune them properly later.
> -        for db_file in archive_file_set.getByArchive(
> -                self.archive, container=container):
> -            by_hashes.registerChild(os.path.dirname(strip_dists(db_file.path)))
>          for container, path, sha256 in archive_file_set.reap(
>                  self.archive, container=container):
> -            self.log.debug(
> -                "by-hash: Deleted %s for %s in %s" % (sha256, path, container))
> +            if (path, sha256) not in uncondemned_files:
> +                self.log.debug(
> +                    "by-hash: Deleted %s for %s in %s" %
> +                    (sha256, path, container))

This could use a comment to clarify that the "file" in "uncondemned_files" may correspond to multiple ArchiveFiles, so a file may be removed even if it is in uncondemned_files.

>  
>          # Ensure that all files recorded in the database are in by-hash.
>          db_files = archive_file_set.getByArchive(
> diff --git a/lib/lp/soyuz/interfaces/archivefile.py b/lib/lp/soyuz/interfaces/archivefile.py
> index 005a599..e81419f 100644
> --- a/lib/lp/soyuz/interfaces/archivefile.py
> +++ b/lib/lp/soyuz/interfaces/archivefile.py
> @@ -49,6 +49,16 @@ class IArchiveFile(Interface):
>          title=_("The index file in the librarian."),
>          schema=ILibraryFileAlias, required=True, readonly=True)
>  
> +    date_created = Datetime(
> +        title=_("The date when this file was created."),
> +        # XXX cjwatson 2018-04-17: Should be required=True, but we need to
> +        # backfill existing rows first.
> +        required=False, readonly=False)
> +
> +    date_superseded = Datetime(
> +        title=_("The date when this file was scheduled for future deletion."),
> +        required=False, readonly=False)

This seems like the least interesting property of date_superseded. The more interesting one is that it's when this file ceased to be the holder of its path.

> +
>      scheduled_deletion_date = Datetime(
>          title=_("The date when this file should stop being published."),
>          required=False, readonly=False)


-- 
https://code.launchpad.net/~cjwatson/launchpad/+git/launchpad/+merge/373971
Your team Launchpad code reviewers is requested to review the proposed merge of ~cjwatson/launchpad:archive-file-history into launchpad:master.


References