← Back to team overview

launchpad-dev team mailing list archive

Archive deletion strategy

 

Hi,

I have been asked to work on reclaiming the space from COPY archives if
desired, as they can be fairly substantial, and if we are going to
create more, we want to make sure we can delete the ones that we don't
care about.

There are various other issues that are tied to this though, so I wanted
to write down what I have learned over the last couple of days and to
ask for opinions on what we should do about deleting archives.


What do we mean by deletion?
----------------------------

At this level archives are containers of source and binary publishing
records, which in turn refer to source and binary packages (that may be
shared across binary archives).

These source and binary packages are stored in the librarian, and they
may also be stored on other filesytems ("published") as apt archives.

Deletion is therefore four things:

  1. Removal of the apt archive on disk
  2. Removal of the librarian files
  3. DELETED publication records for each package in the archive
  4. Deletion of the database rows

and we can do any of these parts in various combinations, and vary them
depending on the type of archive if we like.


Current situation
-----------------

Currently any archive can be deleted, by (correct me if I am wrong)

  1. PPA: owner or any uploader
  2. COPY: owner
  3. PRIMARY, PARTNER, DEBUG: distribution driver

This is done by calling IArchive.delete().

This sets the status of the archive to DELETING.

In addition it causes all of the packages to be "deleted" in the
publication sense, with DELETED publication records created for
everything currently published. (It does this in the method, but
Julian suggested that it should be moved outside as it will timeout
for larger archives).

This is currently only really done for PPAs I believe. We have never
wanted to do it for PRIMARY, PARTNER or DEBUG, and COPY are just
disabled as I understand it, though I am not sure why.

The effect that this has varies (I believe) based on the type of
archive. For PPAs this means that the librarian files will be garbage
collected in 7 days, but for PRIMARY I think the librarian will
keep them. I'm not sure about the other types. Obviously the files
are only removed if they are not still referenced by live
records elsewhere (e.g. fairly common for .orig.tar.gz for PPAs)

  [ Can someone point me to the code that garbage collects PPA
    files from the librarian ]


Then the publisher takes over. This is a cronscript which publishes
the apt archives to disk from the librarian. It has a second
behaviour though in that it looks for DELETING archives and acts on
those too.

It would normally act to delete all the packages due to the creation of
the DELTED publication records for every package, but it skips this
as the archive is in the DELETING state.

It currently only acts on PPAs in the DELETING state, for which it
deletes the tree on disk that makes up the apt archive. It then
sets the status of the archive to DELETED.

It just skips other archive types, so they become essentially frozen
from this point of view.

Therefore we are in this state currently:

  - Only PPAs are actually deleted
  - They are removed from disk by the publisher. Other archive types
    wouldn't be.
  - Their librarian files are garbage collected (I think). Other archive
    types may not be.
  - All archive types have DELETED publication records created.
  - All archives remain in the DB.


Where would we like to be?
--------------------------

I think I have heard that we would like to nuke PPAs so that users can
reuse the name.

We would like to be able to purge copy archives to reclaim the librarian
and apt archive space.

When discussing this on IRC the other day there was concern that
purging archives would lead to unrecoverable mistakes. Currently as the
artefacts are kept, for a few days at least, the archive could be
almost reconstructed if it was accidentally deleted.

Please reply with other constraints on the solution that you would like.


What actions do we have?
------------------------

As I see it we currently have the following:

  1. Move creation of DELETED publishing records to the publisher.
  2. Modify the publisher to remove COPY archives from disk as well as
     PPAs.
  3. Modify the publisher to also remove the librarian files of COPY
     archives. (Alternatively we could extend the garbage collection
     approach used for PPAs to COPY archives.)
  4. Modify the publisher to delete db rows of publication records,
     unique sources and binaries, etc. in COPY archies. (Less important
     than deleteting the librarian files.)

There will be alternative or further actions depending on what we want
to achieve.


This email is no doubt full of rumours, half-truths, and libel. Your
assistance in correcting the public record is appreciated.

Thanks,

James



Follow ups