launchpad-dev team mailing list archive

Thread
Date
Yellow squad weekly retrospective meeting minutes: June 29

To: Launchpad Development List <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Mon, 02 Jul 2012 11:22:25 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1
On blog:
http://codesinger.blogspot.com/2012/07/yellow-squad-weekly-retrospective.html

Headlines

Project report (now divided into "status" and "goals" sections)

Tricks
 * gary_poster: how can code determine if it is being run within an LXC
container?
 * benji: use -s flag to combine nosetests and pdb
 * bac: nosetests will not discover test modules if the execute bit is set
 * gmb: beware: lpsetup probably shouldn't overwrite your SSH keys with
nonsense
 * bac: beware: multiple running gpg-agents are bad. Years-old login
files (.profile, .bash_rc) are bad.

Successes
 * bac: a fast-to-deploy, repeatable tarmac story

Pain
 * benji: decoy doctests
 * benji: chafing on frameworks
 * gary_poster: updating lpsetup's system workarounds over time
 * gary_poster/frankban: integration tests for lpsetup, take 2


And now, the minutes.

--------------------------------


Attendance

Attending: bac benji frankban gary_poster.
gmb is unavailable, though he shared notes for one item.
(These are freenode.net nicks)


Project report

Status

gmb was unavailable this week, though he still worked on a related
project, as noted below.
 * We've twice gotten as high as 97% success rate over a three day
rolling period of tests, and once as low as 83%.  Our goal is 95% or higher.
 * We had one new failure, which occurred once: the networking did not
start within our one minute timeout in the (non-ephemeral) lxc we use
for initially building the code before the tests are run.  It worked
fine before and after the failure on the same machine, and we don't know
what to do to investigate this further.  It smells a bit related to bug
1014916, mentioned last week as getting a fix, but the symptoms are
somewhat different and the described symptoms of bug 1014916 have not
occurred since we instituted the workaround.
 * As in past weeks, we saw a few instances of timeout related failures
from 974617 and 1011847, and one timeout related failure from 1002820.
For the first pair, we asked for assistance and opinions from stub and
lifeless.  Stub mentioned increasing the timeout again, but since we are
already at what we perceive to be a large 3 minute timeout we have not
pursued that yet.
 * The two previous bullet points describe the entirety of the failures
we encountered.
 * Working on a sprint with the blue squad (jam, jelmer, vila, and mgz)
this week in Amsterdam, gmb has led an effort to get the parallel
testing story working with Launchpad running on Python 2.7 in Precise
LXC containers (it is currently running on Python 2.6 in Lucid LXC
containers, matching the current production configuration in the data
center).  On Friday, gmb and the blue squad made a breakthrough on this
front that fixed a bug in zope.testing and got the tests fully running
in the 2.7/Precise environment.
 * We are still waiting to hear from IS that the two new 24 core
machines to actually run the virtualized tests in production have arrived.
 * We proposed an approach to configure the two production machines,
leveraging lpsetup, and have not yet heard back from IS.
 * We have made progress on the refactoring of lpsetup.  In particular,
we committed initial versions of the inithost and initlxc commands.
 * Led by bac, and thanks to help from James Westby and Diogo Matsubara,
we now have tarmac running tests and managing commits to our main
lpsetup branch.  bac also set up tarmac to run tests and manage commits
for our zope.testing fork.

Also, the announcement is a week late, but thanks to gmb, Launchpad has
a screencast for fixing bugs.  Take a look!

Goals for next week

We've added this section to the Friday call and minutes in order to
eliminate the biweekly status emails I was producing.  The "goals"
section in the biweekly status emails was the only non-duplicated
section that we deemed important.

Next week, frankban will be unavailable all week, and bac, benji and
gary_poster will be unavailable on Wednesday.

 * Continue running parallel tests on the EC2 32 core machine and
aggregating results.
 * Make another attempt on at least one of  974617/1011847 and 1002820.
Land initial and usable versions of the remaining lpsetup commands: get,
update, and inittests.
 * Package and use a refactored version of lpsetup for our parallel
testing setup to validate the fact that it still works there.
 * Agree with IS on an approach to configuring the two new production
machines.


Action Items

ACTION: gary_poster will make a kanban card to create a first cut at
smoke tests for all of our subcommands.
COMPLETED. gary_poster made the card, and benji and frankban subdivided
and completed it.

ACTION: gary_poster will make a kanban card to make the tarmac
gatekeeper enforce our test coverage expectations (the code review of
the enforcement will also include the discussion as to what we are
enforcing as a first cut).
COMPLETED. gary_poster made the card and bac completed it (see
"Successes" for a more complete discussion).  This ended up not
involving a code review, and so the test coverage conversation did not
happen.

ACTION: gary_poster will create a slack card for investigating
integration test approaches.  If someone works on this in slack time and
shows us a way forward, we'll open this conversation again.  Until that
point, or until we successfully release lpsetup for developer usage,
they are postponed and effectively discarded.
COMPLETED.  gary_poster made a card.  frankban and gary_poster also
discussed how to do this because of some manual testing on ec2 that they
both had done.  They proposed a way forward.  See "Problems" for discussion.

ACTION: bac will research how to get and integrate tarmac resources (a
testing machine) for a project.  He will first consult with matsubara
about this.  The results will be new/improved documentation on how to
get tarmac deployed for a project, and/or information on what it would
take to make this easier.
COMPLETED. bac documented how to do this with James Westby's puppet
scripts and Canonistack here:
https://dev.launchpad.net/yellow/TarmacOnCanonistack.  Diogo Matsubara's
current approach was very nice but required us to have access to the QA
lab.  We requested access from IS and have not heard back.  James'
approach let us move forward quickly.  He is reportedly working on a
Juju solution to replace the Puppet scripts, and we'll be interested in
that when it is ready.  This action item came from concerns about
Launchpad's zope.testing fork.  bac also integrated tarmac with our
zope.testing fork as part of this effort, to gate landing code with
running tests.


New tricks

 * gary_poster: how can code determine if it is being run within an LXC
container?

gary_poster asked Serge Hallyn if there were a reliable way for code to
determine if it is being run within an LXC container.   Serge said yes,
and gave these steps (note that this may be Ubuntu-specific;
Ubuntu-specific is good enough for us right now).

if /bin/running-in-container is present (precise and above, always), run
it and check for 0 return value
else, if lxc-is-container is not present, assume lxcguest is not
installed and you're not in a container (or are in trimmed container)
else, run lxc-is-container, if 0, you're in a container, if 1 you're not

gary_poster translated that into this Python code, which seems to work
everywhere he's tried it so far.

import subprocess, errno
def running_in_container():
    # 'running-in-container' is Precise and greater; 'lxc-is-container' is
    # Lucid.  These are provided by the lxcguest package.
    for command in ('running-in-container', 'lxc-is-container'):
        try:
            return not subprocess.call([command])
        except OSError, err:
            # ENOENT means "No such file or directory."
            if err.errno != errno.ENOENT:
                raise
    return False

Someone else on the #ubuntu-server freenode channel also recommended
https://github.com/kwilczynski/facter-facts/blob/master/lxc.rb for
ideas, which gary_poster passes on without having given much more than a
glance so far.

 * benji: use -s flag to combine nosetests and pdb

benji discovered that nosetests eats stdout by default, which is not
terribly helpful if you want to use pdb.  Use nosetests' -s flag for
great justice.

 * bac: nosetests will not discover test modules if the execute bit is set

See title.  bac found this surprising.  nosetests --help gives the
workaround and explanation.
  --exe                 Look for tests in python modules that are
executable.
                        Normal behavior is to exclude executable modules,
                        since they may not be import-safe [NOSE_INCLUDE_EXE]

 * gmb: beware: lpsetup probably shouldn't overwrite your SSH keys with
nonsense

gmb pointed out (via a pre-recorded note) that it is possible, and
arguably too easy, to make lpsetup overwrite your SSH keys with
nonsense.  Admittedly, what he did what was a mistake, but still.  We
already have a card for making an interactive installation story for
lpsetup, but this is worth its own bug
(https://bugs.launchpad.net/lpsetup/+bug/1018823) and kanban card.

 * bac: beware: multiple running gpg-agents are bad. Years-old login
files (.profile, .bash_rc) are bad.

It's pretty common to be warned that you should not have multiple
running gpg-agents.  However, you might not realize that, as you accrete
login files across distribution upgrades over the years, these may add
multiple running gpg-agents that you didn't notice.  He didn't.  Beware!
 I suppose the lesson to be learned is that you should carefully review
your login files after each distribution upgrade?


Successes

 * bac: a fast-to-deploy, repeatable tarmac story

In last week's meeting, we identified some costly mistakes because we
had committed some failing tests to two projects, lpsetup and
Launchpad's zope.testing fork.  We wanted to make that impossible via
automation. tarmac is a merge manager for Launchpad that can run tests
before it merges, and it has widespread use at Canonical.  We wanted to
use it for this automation.

bac took on this task and made excellent progress: tarmac now gates both
projects, automatically merging from approved merge proposals unless the
merged code fails a test run.

A primary source of that success was incrementalism--making incremental
steps towards the goal, in order to bring value as quickly as possible.
 bac brought value quickly by choosing a solution that could be
immediately available (Canonical's internal OpenStack cloud resources),
rather than the alternative, which requires IS to get around to giving
us access to new resources.  The solution also does not use Juju, which
we would have preferred; but waiting on the Juju charm to be written
would not have brought value as quickly.  We should be able to migrate
to a Juju charm when it is available, but meanwhile, we have something
working and bringing value now.

Another important source was communication and company sharing.  We
published our meeting minutes last week, communicating our needs and
plans.  James Westby read them, and offered to share his solution.
James and bac coordinated, and James' solution was the quick-to-deploy
one that we have now. That's a big validation for us of the effort we
are making to share these minutes!  It's also a big cause for a thank
you to James.  Thanks!

bac took the communication idea two steps further.

 * First, if you are a Canonical employee, you might be interested in
the documentation bac wrote for using James' solution.  It is here:
https://dev.launchpad.net/yellow/TarmacOnCanonistack.
 * Second, bac had some trouble configuring tarmac, once he had deployed
it.  He changed tarmac's documentation, and just a bit of the code, to
hopefully make things easier for the next person to come along.  His
merge proposal into tarmac is here:
https://code.launchpad.net/~bac/tarmac/make_treedir/+merge/112840.


Pain

 * benji: decoy doctests

lpsetup has some docstrings that have examples in them.  The examples
look suspiciously like doctests, and he thought they were.  This caused
him some confusion, because the examples were a decoy: they looked like
doctests, but they were not hooked up to actually run in the test suite.
 The examples are actually rewritten as unit tests in the normal test suite.

Could we either remove the docstring tests or hook them up?

gary_poster: which should it be, removal or test suite integration?  If
the examples in a docstring are good for documentation, we should keep
them and make them run in the test suite.  Even if the examples are only
moderately valuable as examples, they can also effectively provide a
warning system for when the docstring's prose needs to be reviewed.

[Editor: We didn't talk about it much, and didn't come to a strong
resolution, but we are generally preferring removal at this time as a
matter of practice.]

 * benji: chafing on frameworks

lpsetup is a fairly small package, but it also works as its own small
framework.  In order to implement the subcommand pattern (in which you
can run "[COMMAND] [SUBCOMMAND] [options]", like "lpsetup inithost" or
"lpsetup initlxc" or "lpsetup get"), the main "lpsetup" command calls
the subcommands (e.g., "inithost," "initlxc," etc.).  Therefore, when
you write a subcommand, you are experiencing inversion of control, which
is a primary characteristic of a framework.  Moreover, the subcommands
are generally created by subclassing and extending a subcommand base
class, which is another pattern typical of a framework.

benji has been burned by frameworks, and prefers libraries.  For
lpsetup, the framework is small and malleable enough that the annoyances
encountered have only been minor, but in the future he would prefer to
avoid inversion of control entirely, unless it is truly called for.  He
gives examples of reasonable inversion of control as select loop code
like Twisted, UI toolkits like GTK, and URL dispatch like Ruby on Rails
("RoR").

frankban: isn't the lpsetup approach really a similar pattern to RoR URL
dispatch?
benji: maybe.  I'm not that worried about lpsetup.  In fact...

benji: ...gary_poster asked me to talk about this, after I mentioned the
thought to him. Why, gary_poster?
gary_poster: our squad has had pain in the past with developers
rewriting other developer's code.  A concrete example is our lp2kanban
project (https://launchpad.net/lp2kanban), which pushes Launchpad bug
state to LeanKitKanban. One developer switched us from another
developer's functional approach to an object oriented approach.  If we
can roughly agree on design goals initially, that should reduce rework,
and maybe reduce friction.

gary_poster: Another point is that this appears to be a particular
problem for slack projects done by an individual that become
team-maintained projects--at least, we have two data points in this
direction, lpsetup and lp2kanban.  We have already realized that slack
projects need to be analyzed for their future maintenance expectations
when they are first proposed, and this is further confirmation of that.

benji: how can we agree on a design productively?
bac/benji: we want to encourage autonomy and avoid design-by-committee.
gary_poster: I think our prototype process can address this.  Our
(incredibly simple) checklist about this says that a person or two
should prototype, and then we all come together to discuss and agree on
what the rewrite should look like, and then we actually write the code
with TDD.
bac/benji: If a developer has an issue with the design after we've
discussed and agreed, our default stance should be "When in Rome...":
follow the existing design.  A corollary for us might be that "if you
want to rebuild Rome, ask the citizenry first": we should only rewrite
if we have built consensus.

 * gary_poster: updating lpsetup's system workarounds over time

One of the goals that Robert gave us for the lpsetup project was that we
would be able to run it again in the future and have it update a system
to remove workarounds that were no longer necessary and add newly
discovered workarounds.  Our preexisting code did not attempt this, and
we have not tried to code this yet.  gary_poster made a strawman
proposal about how to do this in code.  What do we think?

benji: this sounds a lot like Debian packages.  It's a hard problem, and
packaging systems have been working on it for a long time.  Maybe we
should have a Debian package that manages our workarounds for an LXC
host, and one that manages our workarounds for an LXC container?  We all
think benji might be rather clever.  ACTION: gary_poster will try to
arrange a time to discuss this with Robert.

 * gary_poster/frankban: integration tests for lpsetup, take 2

Last week we talked about how we might make integration tests for
lpsetup, and resolved to create a slack card to investigate.  In the
course of doing manual integration tests, gary_poster gathered some
information that might help automated tests.  frankban had already done
similar work. gary_poster and frankban discussed it.  gary_poster
recorded notes from the discussion and made a simple proposal for a way
forward (https://lists.launchpad.net/yellow/msg00971.html).  Any comments?

No comments.  ACTION: gary_poster will make a kanban card for developing
an integration test suite that works in the way described for the first
(manual run) step.