← Back to team overview

launchpad-dev team mailing list archive

Re: Archive deletion strategy

 

Hi James

Thanks for posting this.  I've got a few corrections to some of your 
assumptions and some suggestions inline.

On Monday 02 August 2010 21:40:28 James Westby wrote:
> Hi,
> 
> I have been asked to work on reclaiming the space from COPY archives if
> desired, as they can be fairly substantial, and if we are going to
> create more, we want to make sure we can delete the ones that we don't
> care about.
> 
> There are various other issues that are tied to this though, so I wanted
> to write down what I have learned over the last couple of days and to
> ask for opinions on what we should do about deleting archives.
> 
> 
> What do we mean by deletion?
> ----------------------------
> 
> At this level archives are containers of source and binary publishing
> records, which in turn refer to source and binary packages (that may be
> shared across binary archives).
> 
> These source and binary packages are stored in the librarian, and they
> may also be stored on other filesytems ("published") as apt archives.
> 
> Deletion is therefore four things:
> 
>   1. Removal of the apt archive on disk
>   2. Removal of the librarian files
>   3. DELETED publication records for each package in the archive
>   4. Deletion of the database rows

Steps 3 and 4 are mutually exclusive.  Either we delete the DB rows or we mark 
them (the publishing records) in the DELETED state.
 
> and we can do any of these parts in various combinations, and vary them
> depending on the type of archive if we like.

We could if we wanted keep two modes of deletion:
1. just delete the repo area but keep this history (database publishing 
records)
2. Blow everything away

(1) is what we have right now.  I prefer (2) as it's less confusing to the 
user.

> Current situation
> -----------------
> 
> Currently any archive can be deleted, by (correct me if I am wrong)
> 
>   1. PPA: owner or any uploader

This is just owner.  Uploaders *can* be outside the owning team.

>   2. COPY: owner
>   3. PRIMARY, PARTNER, DEBUG: distribution driver
> 
> This is done by calling IArchive.delete().
> 
> This sets the status of the archive to DELETING.
> 
> In addition it causes all of the packages to be "deleted" in the
> publication sense, with DELETED publication records created for
> everything currently published. (It does this in the method, but
> Julian suggested that it should be moved outside as it will timeout
> for larger archives).

We don't need to do this, we can just remove that code.  See below.

> 
> This is currently only really done for PPAs I believe. We have never
> wanted to do it for PRIMARY, PARTNER or DEBUG, and COPY are just
> disabled as I understand it, though I am not sure why.

COPY archives get disabled so that it stops their builds getting dispatched.  
We've never considered the other archive types since they don't have (or are 
not supposed to have) a PPA-like URL.

> The effect that this has varies (I believe) based on the type of
> archive. For PPAs this means that the librarian files will be garbage
> collected in 7 days, but for PRIMARY I think the librarian will
> keep them. I'm not sure about the other types. Obviously the files
> are only removed if they are not still referenced by live
> records elsewhere (e.g. fairly common for .orig.tar.gz for PPAs)
> 
>   [ Can someone point me to the code that garbage collects PPA
>     files from the librarian ]

The librarian <hand wave> does the garbage collecting when it sees 
unreferenced or expired libraryfilealiases.  There's a script 
cronscripts/expire-archive-files.py which goes through PPAs and expires files 
for packages that are deleted or superseded.

However, if we simply blow away the publishing records and all their 
associated foreign key refs (sourcepackagereleasefile etc), then the librarian 
GC will DTRT as well.

We need to be careful here of course since those files may be published in 
more than one archive.

> Then the publisher takes over. This is a cronscript which publishes
> the apt archives to disk from the librarian. It has a second
> behaviour though in that it looks for DELETING archives and acts on
> those too.
> 
> It would normally act to delete all the packages due to the creation of
> the DELTED publication records for every package, but it skips this
> as the archive is in the DELETING state.
> 
> It currently only acts on PPAs in the DELETING state, for which it
> deletes the tree on disk that makes up the apt archive. It then
> sets the status of the archive to DELETED.
> 
> It just skips other archive types, so they become essentially frozen
> from this point of view.

Right, we don't support deletion of those other types (yet).

> 
> Therefore we are in this state currently:
> 
>   - Only PPAs are actually deleted
>   - They are removed from disk by the publisher. Other archive types
>     wouldn't be.
>   - Their librarian files are garbage collected (I think). Other archive
>     types may not be.
>   - All archive types have DELETED publication records created.
>   - All archives remain in the DB.
> 
> 
> Where would we like to be?
> --------------------------
> 
> I think I have heard that we would like to nuke PPAs so that users can
> reuse the name.

That's correct.

> We would like to be able to purge copy archives to reclaim the librarian
> and apt archive space.

Yep.

> When discussing this on IRC the other day there was concern that
> purging archives would lead to unrecoverable mistakes. Currently as the
> artefacts are kept, for a few days at least, the archive could be
> almost reconstructed if it was accidentally deleted.

When deleting an archive it goes to a page that has a BIG FAT WARNING on it.  
If someone is stupid enough to continue pressing delete at that point when 
they don't really mean to, then I'm not sure what else we can do.

In any case, the librarian GC has a stay-of-execution period of 10 (?) days 
before it deletes the files, even if they're expired/unreferenced.

> 
> Please reply with other constraints on the solution that you would like.
> 
> 
> What actions do we have?
> ------------------------
> 
> As I see it we currently have the following:
> 
>   1. Move creation of DELETED publishing records to the publisher.

We don't need to do that at all because we're going to delete those rows 
entirely.

>   2. Modify the publisher to remove COPY archives from disk as well as
>      PPAs.
>   3. Modify the publisher to also remove the librarian files of COPY
>      archives. (Alternatively we could extend the garbage collection
>      approach used for PPAs to COPY archives.)

You don't need to do this at all, the GC will DTRT for unreferenced files.  We 
just need to make sure we only delete the SPRF/BPF rows if there's no more 
publications referencing them.

>   4. Modify the publisher to delete db rows of publication records,
>      unique sources and binaries, etc. in COPY archies. (Less important
>      than deleteting the librarian files.)
> 
> There will be alternative or further actions depending on what we want
> to achieve.

I think that's pretty much it, although we need to examine exactly which 
database rows need to be deleted and under what conditions.  I've added some 
ON DELETE CASCADEs in the past where it's a no brainer but it won't delete 
everything because of the multiple publication referencing package files 
issue.

> 
> 
> This email is no doubt full of rumours, half-truths, and libel. Your
> assistance in correcting the public record is appreciated.

Hopefully I've set you on track!

Cheers
J



Follow ups

References