← Back to team overview

launchpad-dev team mailing list archive

Yellow squad retrospective minutes: June 15

 

Pretty version

http://codesinger.blogspot.com/2012/06/yellow-squad-weekly-retrospective_18.html

Headlines

Attendance
Project plan
Action items
New tricks
  gmb: beware: zope.testing will occasionally swallow ImportErrors
  benji: beware: be careful when powering off lxc containers
  benji: beware: if you manually poweroff ec2 you might break your next
juju run
  frankban: beware: change working directory before imports in your
Python code
Successes
  gary_poster: look at how well our test runs are doing!
Pain
  benji: What can we learn from bug 996729: Global state hates you, and
stdout is global state
  bac: What can we learn from bug 682772: Don't go on vacation with an
active card
  gary_poster: Increased timeouts are not enough
  gary_poster: The discrepancy between worker completion times is really
costing us

Attendance

Attending: the gang's all here (bac benji frankban gary_poster gmb)
(These are freenode.net nicks)


Project plan

 * We have fixed all the critical bugs we know about for the parallel
testing project, except for recurring problems with bug 994752, 1011793,
and 1002820.  See discussion in "Problems" below.
 * Our lxc workers are completing at very different times--a difference
of more than 10 minutes, typically.  If we can fix this, perhaps we
could save another five minutes (of 38-ish on our 32 core testing EC2
machine).  See discussion in "Problems" below.
 * We are still waiting on the two new 24 core machines in the data
center to actually run the virtualized tests in production.  We continue
to use a 32 core EC2 machine and Juju for our tests now.
 * We are moving to stretch goal.  lpsetup gets first billing, and
hopefully we'll have time for a Juju Launchpad developer charm and a
revised "ec2 test" command also.


Action Items

 * Gary was supposed to add a number of checklists for us to follow,
about working across teams, running the morning meeting, and working as
the observer in pair programming.  This is done. All of them are
started, and listed with our other checklists, here:
https://dev.launchpad.net/yellow.


New tricks

 * gmb: Beware: zope.testing will occasionally swallow ImportErrors

If zope.testing reports not being able to find a test, try importing
your test in the Python interpreter.  It may be an import error.  Gmb
suggests that the new tagline for zope.testing is "Likes to mess with you!"


 * benji: Beware: be careful when powering off lxc containers

You need to be very careful when calling poweroff in your lxc container,
because it might not be your lxc container but your host!  gary_poster:
I've thought it might be nice to have a lxc-poweroff installed in the
container for just this reason.  It would not exist in the host, and in
the container it would simply be an alias for/link to poweroff.


 * benji: Beware: if you poweroff ec2 you might break your next juju run

If you poweroff your ec2 host (see the previous warning), juju may not
work properly the next time you start a new environment: one of your
machines may hang in "pending" forever.  It's far from obvious, but this
is a complaint caused by juju trying to reuse a security group name that
is currently connected to your old non-terminated ec2 instance.  You
need to terminate that old instance (e.g. in the EC2 AWS console) so
that the next juju startup won't be "pending" forever.


 * frankban: Beware: change working directory before imports

Launchpad's setup code did a lot of imports, including docutils; and
then changed the working directory. That broke things, because importing
docutils sets a global state at import time based on the current working
directory.  He fixed this by changing the working directory before
importing.  benji: two corollaries are "global state hates you," and
"don't do anything at module import time."


Successes

 * gary_poster: Look at how well our test runs are doing!

We had a recent record-breaking 13 passing runs in a row, which got us
back above the 90% success rate; and we only have a few bugs left.
Let's take a second and be happy about it!  benji: learning about lxc
and ec2 and ebs has been a lot of fun, and it has been nice to be
working on and helping with code and tools that are essential to the
company.  gmb: agrees.


Pain

 * benji: What can we learn from bug 996729: Global state hates you, and
stdout is global state

[Editors note: The kanban card for bug 996729 took more than a day to
move out of coding, so it automatically became a topic for the weekly
call, per our checklist.]

This bug was about making sure that Launchpad testrunner's subunit
generation code is the only code that has control over stdout and stderr
when tests are run in this mode (--subunit).  This is important because
the testrunner must be producing a parsable subunit stream, and random
junk can pollute it and break subunit consumers.  Getting it to work
properly has taken more than a week.  Why?

Global state hates you.  stdout and stderr are global state, and code
everywhere expects to be able to write to them, and manipulate them.
For example, zope.testing uses stderr as a back channel for the child
process (running a test layer) to talk to the parent process
(aggregating the results).  Because running tests involves arbitrary
code, that's very unsafe. Another example that we encountered is
forking: remember that forking copies global state, including things you
might not realize like the Python buffer of stdout and stderr.  That one
was a bit painful to debug.

The Python "mutable" and "pretend I'm immutable" variants of stdout and
__stdout__ are no solution either.  We had to try to assert priority:
noone gets to write to stdout except our code...so we ended up having to
mutate sys.__stdout__ too.  Does that mean that there should be an
___stdout___ and ___stderr___?  When does it end?  (We briefly jokingly
suggest names "dunder" for double under, "tunder" for triple under, and
so on.)  As an associated point, don't assume __stdout__ and __stderr__
are the things that were in stdout and stderr before you started mucking
with them.  Who knows what some crazy other code has done with those
globals?

gmb: If you muck with stdout and stderr, we on the yellow squad will run
around with shotguns being paranoid.

benji/gmb: testr's desire/expectation that subunit will own stdout is
problematic from our perspective: ssh combines stderr and stdout, and
stdout is used for so many things in arbitrary code, and even required
for things like pdb.  We talked about other options but stuck with this
approach because Robert preferred it, we understood his point, and there
was no guarantee that the other approaches, like giving subunit its own
channel, would be any less problematic within the large scope of our
project.


 * bac: What can we learn from bug 682772: Don't go on vacation with an
active card

[Editors note: The kanban card for bug 996729 took more than a day to
move out of coding, so it automatically became a topic for the weekly
call, per our checklist.]

Don't leave for vacation when you have an active card, unless you've
explicitly handed it off.


 * gary_poster: Increased timeouts are not enough

We have LXC startup timeouts that were supposed to be solved by bug
994752, which increased timeouts and made some code more efficient.  On
test EC2 machines, we have increased the timeout to as much as 10
minutes and still, rarely, seen startup timeouts. Bug 1002820, and
possibly bug 1011793 have not been solved by increasing timeouts.  They
continue to occur, rarely but frequently enough to keep us from
achieving our goal statistics.

A simple thing to try is to decrease the number of LXC instances we run
at once, using a hack until bug 957145 is addressed [Editor's note:
which it has been now--thanks Robert!].  When total test time starts
increasing, maybe that's too little parallelization.  When tests get
better passing stats and test time is about the same, maybe that means
we've found some kind of ideal. But this takes a long time to determine,
and it doesn't dig into the underlying problem.What else can we do to
investigate and solve the problem?

benji: Bug 1011793 may be solved by returning a missing log directory.
To investigate the rest, try using iostat and vmstat, and pidstat -T for
each lxc.

[Editor's note: since the call, we have done some of this investigation,
and have some interesting results.  We intend to do a bit more data
gathering and then file a bug for our fine LXC engineers to explore; and
we think that bug 1002820 is a separate issue.]


 * gary_poster: The discrepancy between worker completion times is
really costing us

Often the first LXC container in the parallel tests to complete its
tests finishes 12 or 13 minutes before the last LXC container in the
same test run.  Even given a three minute timeout for LXC starts (see
previous item), that's a big discrepancy.  Why is it happening?

We agreed it was worth filing a card for, but did not have any immediate
insights.  We briefly lamented that we could not feed tests to workers
dynamically, or just in time; but had to decide what tests each worker
should run at the start of the test run.

[Editor's note: investigating this problem later in the day led to us
filing bug 1013921, which might help significantly and shave off as much
as five minutes from the total test time.]