launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #25296
Re: [Merge] ~cjwatson/launchpad:archive-file-history into launchpad:master
I found the new _updateByHash structure somewhat difficult to understand, and tried a different approach in https://code.launchpad.net/~wgrant/launchpad/+git/launchpad/+merge/390811. Not nearly landable, but I think the flow is much clearer, and I'd be interested in your thoughts.
Diff comments:
> diff --git a/lib/lp/archivepublisher/publishing.py b/lib/lp/archivepublisher/publishing.py
> index 5908ccc..4735b89 100644
> --- a/lib/lp/archivepublisher/publishing.py
> +++ b/lib/lp/archivepublisher/publishing.py
> @@ -1084,33 +1083,54 @@ class Publisher(object):
> real_path = os.path.join(suite_dir, real_name)
> current_files[path] = (
> int(current_entry["size"]), current_entry["sha256"], real_path)
> +
> + # Gather information on entries currently in the database. Ensure
> + # that we know about all the relevant by-hash directory trees before
> + # doing any removals so that we can prune them properly later, and
> + # work out which condemned files should be reprieved due to the
> + # paths in question having their previous content again.
> + reprieved_files = defaultdict(list)
> uncondemned_files = set()
It took me far too long to work out that uncondemned_files has been completely repurposed: it previously contained the files that were to be uncondemned (and so were to have their condemnation rescinded), but now is those files that are *currently* uncondemned (and so may render the rescinding of another file's condemnation, now in the set called reprieved_files, unnecessary). This explains why I got rapidly confused by the changes when I looked at this a few times over the last couple of years!
> for db_file in archive_file_set.getByArchive(
> - self.archive, container=container, only_condemned=True,
> - eager_load=True):
> - stripped_path = strip_dists(db_file.path)
> - if stripped_path in current_files:
> - current_sha256 = current_files[stripped_path][1]
> - if db_file.library_file.content.sha256 == current_sha256:
> - uncondemned_files.add(db_file)
> - if uncondemned_files:
> - for container, path, sha256 in archive_file_set.unscheduleDeletion(
> - uncondemned_files):
> + self.archive, container=container, eager_load=True):
> + by_hashes.registerChild(os.path.dirname(strip_dists(db_file.path)))
> + file_key = (db_file.path, db_file.library_file.content.sha256)
> + if db_file.scheduled_deletion_date is None:
> + uncondemned_files.add(file_key)
> + else:
> + stripped_path = strip_dists(db_file.path)
> + if stripped_path in current_files:
> + current_sha256 = current_files[stripped_path][1]
> + if db_file.library_file.content.sha256 == current_sha256:
> + reprieved_files[file_key].append(db_file)
> +
> + # We may already have uncondemned entries with the same path and
> + # content as condemned entries that we were about to reprieve; if
> + # so, there's no need to reprieve them.
> + for file_key in uncondemned_files:
> + reprieved_files.pop(file_key, None)
> +
> + # Make sure nothing in the current Release file is condemned.
> + if reprieved_files:
> + reprieved_files_flat = set(
> + chain.from_iterable(reprieved_files.values()))
> + archive_file_set.unscheduleDeletion(reprieved_files_flat)
> + for db_file in reprieved_files_flat:
> self.log.debug(
> "by-hash: Unscheduled %s for %s in %s for deletion" % (
> - sha256, path, container))
> + db_file.library_file.content.sha256, db_file.path,
> + db_file.container))
Is separate reprieval interesting now that it just creates a new row anyway? It makes already complicated code significantly more involved, just to avoid either uploading a file again.
>
> # Remove any condemned files from the database whose stay of
> # execution has elapsed. We ensure that we know about all the
> # relevant by-hash directory trees before doing any removals so that
> # we can prune them properly later.
> - for db_file in archive_file_set.getByArchive(
> - self.archive, container=container):
> - by_hashes.registerChild(os.path.dirname(strip_dists(db_file.path)))
> for container, path, sha256 in archive_file_set.reap(
> self.archive, container=container):
> - self.log.debug(
> - "by-hash: Deleted %s for %s in %s" % (sha256, path, container))
> + if (path, sha256) not in uncondemned_files:
> + self.log.debug(
> + "by-hash: Deleted %s for %s in %s" %
> + (sha256, path, container))
This could use a comment to clarify that the "file" in "uncondemned_files" may correspond to multiple ArchiveFiles, so a file may be removed even if it is in uncondemned_files.
>
> # Ensure that all files recorded in the database are in by-hash.
> db_files = archive_file_set.getByArchive(
> diff --git a/lib/lp/soyuz/interfaces/archivefile.py b/lib/lp/soyuz/interfaces/archivefile.py
> index 005a599..e81419f 100644
> --- a/lib/lp/soyuz/interfaces/archivefile.py
> +++ b/lib/lp/soyuz/interfaces/archivefile.py
> @@ -49,6 +49,16 @@ class IArchiveFile(Interface):
> title=_("The index file in the librarian."),
> schema=ILibraryFileAlias, required=True, readonly=True)
>
> + date_created = Datetime(
> + title=_("The date when this file was created."),
> + # XXX cjwatson 2018-04-17: Should be required=True, but we need to
> + # backfill existing rows first.
> + required=False, readonly=False)
> +
> + date_superseded = Datetime(
> + title=_("The date when this file was scheduled for future deletion."),
> + required=False, readonly=False)
This seems like the least interesting property of date_superseded. The more interesting one is that it's when this file ceased to be the holder of its path.
> +
> scheduled_deletion_date = Datetime(
> title=_("The date when this file should stop being published."),
> required=False, readonly=False)
--
https://code.launchpad.net/~cjwatson/launchpad/+git/launchpad/+merge/373971
Your team Launchpad code reviewers is requested to review the proposed merge of ~cjwatson/launchpad:archive-file-history into launchpad:master.
References