← Back to team overview

ubuntu-phone team mailing list archive

Re: Landing team meeting 19.03.14

 

On Thu, Mar 20, 2014 at 03:43:02PM -0300, Gustavo Boiko wrote:
> The only problem is that this doesn't scale. While one big feature lands
> (say, Qt 5.2), there are at least five or more others being developed and
> maybe even proposed. So we pick one of those to land, and then while we are
> in the "small cycle" for that one, there are already four other in the
> wait, and more being developed. It takes way too much time to get
> everything in, which means features get released much later than they
> could, which in turn means they will have less testing time in the end.

Right.  Particularly if you have anything that involves a non-trivial
dependency stack (I have some things where I can't really work out what
I'll need to do next until I've landed the stage before), then the
current process results in things taking weeks longer than they should.

I'm quite confident that I would have "click chroot" working smoothly by
now for 14.04 frameworks if not for this, for example - right now it
doesn't install all of the necessary qtdeclarative plugins, and in my
judgement the only sane way to fix this involves going through
correcting multiarch metadata in lots of library packages throughout the
stack so that I can simply have it install "ubuntu-sdk-libs-dev:armhf"
rather than hardcoding a huge pile of package names in click.  Despite
the best efforts and good will of the landing team, this sort of thing,
which carries extremely low runtime risk, is very poorly served by the
process that's in place at the moment; I'm afraid it feels unnecessarily
obstructive.  Every time we multiply what could have been a couple of
hours of delay into a couple of days, it destroys productivity just as
surely as an edit-compile-test cycle that takes hours rather than
minutes.

I get that having a working image that users can upgrade to is
important.  I really do, especially as we move towards shipping devices.
But if you set the level of acceptable risk to zero, then you also
cripple velocity; much though we need to keep things working so that we
can dogfood, we also still have a lot of catching up to do before we
surpass (say) Android's usability.  I don't think we can afford this in
the long term.  One of our key development assets is significant
parallelisation across a wide range of projects, drawing on the whole
free software community.  Serialising all this into a narrow bottleneck
of landings throws away that asset in the cause of risk aversion.  It is
not clear to me that it is worth the trade.


Any time somebody brings this up, a frequent response is "well, you can
go and help out with the known regressions".  This is the fallacy of the
interchangeable developer, and it is terrible that we keep perpetuating
it.  It isn't sensible for several dozen people who are blocked on
landings to all try to teach themselves enough about (say) the innards
of Qt from scratch in order to work out what's going wrong.  Some of
them will waste their time flailing, some of them may waste the time of
the people who are actually qualified to fix the regressions by asking
overly basic questions or being generally confused, and hopefully some
of them will pretend they never saw this and get on with something they
can actually do.  Maybe one of them might contribute something helpful,
but probably only if not quite the right people were on the problem to
begin with.  (I include myself in all this; I'm not deprecating my
colleagues' abilities, just recognising that superhuman
masters-of-all-trades don't really exist.)  Knowledge sharing is good,
yes, but a firedrill isn't a good time to do it, and it probably
shouldn't be everyone-to-everyone anyway.

Surely, what should happen is:

 * Management should identify the people who are qualified to fix the
   regressions in question, make sure they're working on it and have
   what they need (hopefully they'll do it themselves organically, but
   this isn't guaranteed), and make sure this is communicated.  The
   point of this is mainly to make sure that serious problems don't fall
   through the cracks because nobody thinks it's their problem to solve.

 * Engineers should be particularly responsive to requests for help
   during times when there are known regressions, and should be alert
   for problems that touch on their areas of expertise.

 * Work that overlaps with the regressing areas should be treated with
   care, so that we don't pile problem upon problem.

 * Unrelated work should be able to proceed as normal, with caution but
   without undue fear.

 * People who are not qualified to work on the regressions should not be
   told that that's what they need to do if they want to get their code
   landed.

I understand that people are scared that if we don't serialise landings
when things are going wrong then we'll have a series of successive
out-of-phase bugs and we'll end up never being able to promote an image
again.  I think we've drawn too broad a lesson from past problems.  Yes,
we need to be careful not to aggregate risk from lots of nearby
overlapping changes.  But I don't believe that after ten years we don't
have the institutional expertise to spot when two changes truly have
nothing to do with each other - and our stack is big enough that this is
still frequently the case.  Even when our archive was a complete
free-for-all with no automatic installability gatewaying (and so
upgrades were often broken), we still managed to produce pretty decent
desktop milestone images with only a couple of days of heavy lifting,
and most of that was typically cleaning up after problems that
proposed-migration and other parts of the daily quality initiative now
keep out of the archive for us.

I am not at all convinced that the phone stack is so much more complex
that we can't loosen our grip a bit and still have things workable,
especially now that we have some much better technology for catching
many frequent categories of regressions (certainly a worthwhile benefit
of all this hard experience over the last couple of years); and as a
bonus we might not burn out some of our best engineers trying to do
ultra-tight coordination of everything all the time.

Given the choice, which is better: to have slightly more frequent
breakage, but have key engineers be fresh and able to work on urgent
problems that come their way every so often; or to have our key
engineers concentrate hard every day to make sure as few regressions as
possible slip in, at the cost that when difficult problems show up
they're too tired and demotivated to deal with them properly?  I'm
worried that risk aversion means we tend to aim for the latter.

Cheers,

-- 
Colin Watson                                       [cjwatson@xxxxxxxxxx]


Follow ups

References