← Back to team overview

launchpad-dev team mailing list archive

Yellow Squad retrospective minutes May 25

 

= Attendance =

The gang's all here: bac, benji, frankban, gmb, gary_poster

= Project plan =

- We have two 24 core machines coming, one each for devel and dbdevel: yay
- We have not tested regularly with that many cores: boo. We need to
start testing regularly with 32 core EC2 machines when we have them.
This will increase our bug count.
- Encountered some road bumps this week that really slowed us down and
introduced some serious problems (bug 1004088, bug 1003206). We won't
have the statistics we hoped for by our checkpoint next week.

= New tricks =
Nobody mentioned any, but people might be interested in benji's new
terminal tricks: https://dev.launchpad.net/yellow/Termbeamer . It's
nicely packaged and ready for beta testers. Ever thought about sharing a
terminal session over Jabber/GTalk? :-)

= Any nice successes? =

- gary_poster: Apache analysis for bug 1004088 by gmb and bac was our
big success this week. bac & gary_poster: another win for collaboration.

- frankban: lxc-start-ephemeral fix, including lxc-ip, is almost ready.
[Note that, since this meeting, I've discovered that lxc-ip will no
longer be used, in favor of some bash code for now and an improved
lxc-info later. That's a disappointment, but should be a good end state
for lxc.]

= Problems =

- we had a hard time diagnosing bug 1004088. (Take a glance at it if you
are curious: https://bugs.launchpad.net/launchpad/+bug/1004088) benji:
something to remember/note is that, if you are dealing with concurrent
processes, linear divide and conquer approaches fail. frankban: since we
are using lxc-ip, when we log in the only thing that has definitely
started up is the network stack, not Apache. Expecting the machine to be
fully initialized when we started our work was one proximate cause of
the problem, and of why it was difficult to diagnose. benji: we have
this problem of being surprised when environments are not as ready as we
expect frequently (juju etc.): when we write code waiting for something,
we should remember to ask "what is our definition of ready? does it
match the definition of what we are waiting for?"

- gmb: zope.testing changes for bug 996729 broke devel because we broke
the subunit stream so ec2 test couldn't work properly (bug 1003696).
This is also causing us to have to do rework. When we make fundamental
changes to infrastructure...what do we need to do to keep this from
happening? try it first? that seems like a platitude, and we did try it;
we just were not careful enough. It is good that we currently have a
test runner that does not use subunit, because it caught the problem: it
could have been much worse, with buildbot/pqm accepting broken branches
into stable. When we switch to parallel testing, we will be relying even
more heavily on subunit. What can we do to provide a catch for this kind
of problem in the future? benji: Maybe we could have a minimum number of
tests that must pass in buildbot? gary_poster: Maybe have a maximum
negative diff between landings? We could do something like this in
buildbot and maybe ec2 also. ACTION ITEM! File a bug that parallel
testing *and* ec2 should have a minimum number of tests to expect, or a
maximum negative diff (i.e., a given run should not have a number of
tests fewer than [number of tests in the last run] -100). If we want ec2
to have a maximum negative diff, ec2 needs some way of getting the last
test run's test count, such as from buildbot's webservice.

- benji: This is the second time we broke stuff by changing
zope.testing. For the previous failure, we incorrectly cleaned up
someone else's mess. We tried to prevent that mess in the future by
announcing the problem on the mailing list, but that's probably not
really a real solution. benji: instead, we could have a comment at the
top of the versions file about the process to follow if you are using a
custom-built version of code, and point to wiki page. We like this.
ACTION ITEM!

- bac: We decided not to support our buildbot juju charm, after
discussion with Robert and Francis, and so it will leave the charm repo
and die because the ~charmers reasonably don't think it has enough need
(given preference for jenkins). Does this mean that we made a mistake in
creating the Juju charm? We agree that it was a net positive, given our
increased Juju experience, the feedback we were able to give to the Juju
team, the Python charm helpers than Clint intends to package, and the
python-shelltoolbox that Clint intends to package and sponsor in Ubuntu.
Moreover, we brought value along the way, and, in the lean philosophy,
it is fine to discard steps later that were productive at the time. But,
we didn't question this because it was a directive of the project to use
buildbot. Perhaps in future we should question directives? Questioning
all the requirements given to us is annoying and counterproductive to
our clients. However, sometimes it is a good idea. Perhaps we should
question requirements among ourselves first. Before we consider bringing
it up to the customer, and before we spend much time on analysis, we
should make a rough plan and estimate for answering those questions. If
getting the answer is relatively cheap, and/or if the answer is
potentially important, go ahead and raise the question with the
customer, including the rough plan we've assembled to answer it. ACTION
ITEM?? Should we make a checklist for starting a project?

- gary_poster: ACTION ITEM: I should mention to Francis that LP should
maybe maintain the charms for as long as we use buildbot. It sure is
nice to be able to quickly fire off a buildbot environment to test
changes in.

- gmb: “Having root makes you stupid.” We all have ignored the issues
with make clean and /var/tmp/bazaar.launchpad.dev in the past because on
our machines we could (bug 1004088). Whenever you are about to use a big
hammer on a problem, stop and think if you can use a little hammer.
gary_poster: If you encounter an annoyance and investigating it now
doesn’t make sense, add a slack task idea card. That might help you
remember to investigate later when you have a moment, and still let you
get your active card done now. ACTION ITEM?? Should we have a checklist
of what to do when we encounter an annoyance? Can we do something else
to turn this into a process?

- gary_poster: Our juju charm tests have bitrotted. Why didn't we know
sooner? We had automated tests that were supposed to be run by the charm
repository, but the regular runs are not ready yet.

- bac: Similarly, why did we not see the problem sooner for bug 1004088?
The buildbot change that triggered the bug was Friday. gary_poster:
Because no real automatic testing (Gary is the automatic testing
system), and because we had multiple big issues at once (also 1003696
and 1003206 as fallout from 996729. We tried to set up the automatic
testing earlier but broke out of the timebox. :-/

- bac: making experimental changes is really hard to get through our
environment: lpsetup ppa is hard, and ties with LP code changes. benji:
complexity of things interconnecting is a common source of these
annoyances. gary: rich hickey/clojure calls that complecting. francesco:
lpsetup could have a configuration file. benji: call chain visualization
would be nice, but we are talking about projects interconnecting, not
code internally interconnecting. How do we identify these sorts of
problems early? benji: have a requisite boxes and lines diagram? maybe
too much. gary_poster: Another direction: if we wait on the computer for
more than a minute to do something (for example), this is a problem for
the weekly call. As an example, we could have done a lot better if we
had speeded up our juju startup time at the beginning. The whole
parallel project is acknowledging the importance of faster turnarounds.
Long wait times are arguably an indication of problems, in addition to
being a problem itself. ACTION ITEM: add this question to the weekly
retrospective call's problem identification checklist.

- gary_poster: We are not delivering value incrementally. Can we be?
benji: We are fixing some bugs, so that is incremental value. bac: Maybe
we should actually try to fix the big critical thing we encountered with
Apache (the "real" bug for bug 1004088).