← Back to team overview

launchpad-dev team mailing list archive

The State of the Soyuz

 

======================
The State of the Soyuz
======================

Progress on Soyuz can be largely categorised into three items:

1. Feature work
2. Ongoing important bugs (most are tagged 'boobytrap')
3. Firefighting

I shall report on each of these below.


Feature Work
============

We've worked on two main features in the last few months.


Buildd-manager scalability
--------------------------

This feature is largely done barring any last-moment problems.  The buildd-
manager has been extensively and invasively re-written to be cleaner, clearer 
and most importantly fully asynchronous, which finally allows events from all 
the builders to overlap.  We also moved the build upload processing to an 
external queue so it's not done in a blocking fashion inside the manager 
itself.

The result is a lean, mean build farm which is rarely seeing the kind of 
massive build queues seen in the past.  There's a peak in queue length around 
23:30 UTC each day when the daily recipe builds kick off, but these are dealt 
with very swiftly now.


Derived distributions
---------------------

Derived distros are still in full swing.  Approximately two thirds of the UI 
is done (mostly the page that shows the differences between child and parent 
series), but more changes are often being identified as necessary.  A design 
decision in the LEP to simultaneously open and initialise a new distro series 
needs to be redesigned because the Ubuntu team wants to do these steps 
separately now.  We also need to add UI parts to show progress indications of 
things like sync operations and diff requests.

The backend for asynchronously initialising a distroseries from a parent is 
finished (thanks to Steve's hard work) and can be initiated from the API.  
Initiating from the web UI won't be possible until the above redesign is done 
and implemented.

The backend for doing sync operations is nearly finished, and Jelmer assures 
me it will be done before he absconds to the Bazaar team in January!

In progress is the very complicated code that we need to determine the 
differences between two distroseries.  This necessitated some changes to Gina 
so that we have access to the changelog in the database so it can be probed 
for releases that were never separately imported.


Booby Trap Bugs
===============

Any bugs that will cause us to drop everything and brandish fire extinguishers 
if they go off are tagged with 'boobytrap'.  We've been making fairly slow but 
steady progress fixing these (being a man down on the team has not helped).

The main bugs that were fixed are to do with the publisher, which used to hate 
uninitialised distroseries (which has enabled Ubuntu to do early opening of 
future series), the buildd-manager (which was all part of its re-write), and 
package copying.  Package copying bugs are a particular annoyance since we've 
had a few that have made the publisher completely fail and block all PPAs from 
getting publisher.

There are a few more of these in progress now, such as preventing files from 
getting re-uploaded once they've been deleted (which has horrible knock-on 
effects when people then copy those packages to other PPAs) and some buildd-
manager improvements to tolerate better transient builder/network failures.

Finally, we've got a few publisher performance issues caused by a few 
different bugs that end up with superseded/deleted sources that can never be 
condemned for removal.  We've got a good handle on those and they will be 
fixed soon.


Firefighing
===========

Soyuz has had an unfortunate number of production incidents over the last few 
months.  These were all either buildfarm issues or PPA publisher issues, both 
of which are very high profile and high impact.

 * 2010-06-17 - PPA publisher complete failure.  This was caused by it trying 
to write an OOPS file to somewhere it didn't have permission to.

 * 2010-08-12 - after the first stage of the buildd-manager re-write, it ended 
up not catching EINTR properly which caused the running job to be instantly 
failed.

 * 2010-10-07 - death row processing (removing condemned files) was failing 
and causing many PPAs to go over quota with no way of fixing that.  It was 
caused by the Postgres 8.4 upgrade causing a particular query to be an order 
of magnitude slower.

 * 2010-10-28 - failure in the build farm to dispatch any builds, caused in 
part by the efforts to re-write the buildd-manager and getting problems that 
don't occur in the test environment

 * 2010-11-17 - Apache returning "500" error when accessing Private PPAs.  
This was caused by the .htaccess files being written with incorrect 
permissions.


The Future
==========

Who knows what the future holds, other than goodbye Soyuz team, hello Squad 
Red!