launchpad-dev team mailing list archive

Thread
Date
The second Build Engineer's report!

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Gavin Panella <gavin.panella@xxxxxxxxxxxxx>
Date: Fri, 13 Nov 2009 22:11:52 +0000
Organization: Canonical Ltd
I finished being build engineer last week. Here's a summary of the
things I did:

* Fixed bug #422433 (Race condition when running two ec2test instances
  very close together). This needed to be fixed before working on
  making the test suite run in parallel across several machines.

* Investigated a couple of bugs: #419421 (Buildbot: over time memory
  usage of the buildbot master process gets unreasonable) and #

* Got the jscheck builder running more frequently, and, after some
  cajoling, got it to work. Michael Hudson did the ground work for
  this. Fixing this problem taught be a lot about how buildbot works,
  how to configure it, and meant I got to look at a *lot* of source
  code.

* With the help of the LOSAs, got another of mwhudson's lpbuildbot
  branches, use-update-sourcecode, merged and rolled out. This removed
  quite a lot of code from lpbuildbot and replaced it with a single
  call to utilities/update-sourcecode.

* Landed a lpbuildbot branch, avoid-deadlock, to fix a potential
  problem in kill-test-pids where it could hang indefinitely. I can't
  tell if this has ever affected us, but it was worth a small fix to
  prevent it.

* Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should
  not be set when calling test_on_merge). This has been reviewed but
  not merged.

* Prepared a possible fix for bug #419408 (Buildbot: over time,
  buildbot creates zombie processes) and bug #419408 (Buildbot: over
  time, buildbot creates zombie processes). I think these two are
  related; see comment 3 in bug 419408 for an explanation of the
  possible culprit.

  There are actually two branches related to this, the fix itself, and
  a port to staging which rolls in the fix and other changes to the
  production configs. Neither have been merged, but the fix has been
  reviewed.

* Investigated bug 433657 (tests regularly fail on buildbot with "no
  space left on device"). Landed Launchpad branch log-statement-none
  to disable PostgreSQL statement logging (which was set to 'all') to
  see if that might help... but I haven't kept track of failures, so I
  don't actually know. It should be possible to go back through the
  build logs and figure it out.

  I also documented how to put statement logging back for those who
  want it: https://dev.launchpad.net/Debugging

  RT #36179 has been filed to request disk space monitoring on the
  slaves. It would be especially useful to get something like the disk
  space usage report that baobab does when a disk fills up.

* Branch ec2-buildout moves lib/devscripts to a separate place in the
  tree, so that it's another develop egg. The biggest driver for doing
  this was so that it could run with a different Python version. As of
  next Monday that will cease to be an issue, but I think it's still
  useful to treat it as a separate project. There's no need to
  separate it from the Launchpad tree right now, but doing so would be
  quite easy.

  This branch is unfortunately not quite finished; hooking in the
  tests to run was proving a hassle, but I think there's a way around
  that (using subunit, yay). Just got to do it :-/

  I'm CHR next week so maybe I'll help the community by finishing this
  ;)

* My pet project was trying to get the test suite to split itself up
  and run on several machines in parallel, to reduce run time.

  I didn't make much tangible progress on this until the last couple
  of weeks of my stint - only a bus load of reading code and docs -
  but, with a lot of help from jml including a 2-day sprint in London,
  something good has come out of it. There's a branch in review -
  lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and
  jml has an alternative approach at lp:~jml/launchpad/dirty-parry.

  See the cover letter in the ec2-parry merge proposal for an idea of
  how it works. There are two outstanding issues to resolve before
  it'll be generally useful: security around the RPC mechanism needs
  tightening up, and there's a problem where workers are not running
  all the tests. I'll be dogfooding this myself to try and figure
  these out, but if there are any other masochists out there maybe we
  can squash these issues quicker than I can on my own.

Comments on being Build Engineer:

* Getting started was daunting. Suddenly having to actually know about
  PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a
  learning cliff-face, but a few things got me through. Figuring out
  the jscheck issue helped me understand buildbot and be, frankly,
  less scared of it. But most of all, having mwhudson and jml to talk
  to was probably the most reassuring thing.

* For a lot of the BE stint I was fighting little fires (with my water
  pistol of limited knowledge). I got an idea of what I imagine the
  LOSAs feel like every day :-/ (Not the water pistol bit; the
  fighting fires bit).

  I felt like I spent a lot of my time task-switching, and the lack of
  tangible output was a bit of a downer. Coming after mwhudson, who
  did a lot of build-related goodness, I put myself under a lot of
  pressure to make a mark. I guess it's worth reminding future Build
  Engineers that it's also about learning. The BuildEngineer wiki page
  even states as an advantage that "Knowledge about the build system
  is spread around the team."

* I definitely think the BE role is worth it. It's a break from the
  routine. I've learnt a ton that I can bring back to my normal role
  in Bugs. I think I've made improvements to the build side of
  Launchpad (though I wish I could have made more).

* Michael said in his report that "It's hard to get things done on the
  infrastructure in week 4!". It was difficult to get things done in
  the last *three* weeks of the 3.1.10 cycle because there was new
  hardware, U1, Karmic, and a Launchpad release. Especially when it
  comes to buildbot and PQM, much of the BE's role is LOSA intensive,
  and, as I've already fed back to Gary, I didn't feel like I had the
  right to push for attention from them for BE fixes (excepting
  show-stoppers).

* Michael also said "not being able to land branches in week 4 is a
  pain, even more than normal", and "... the build engineer's work is
  sort of sideways to the main thrust of launchpad development".

  It might be beneficial if the BE role was 2 weeks out of sync with
  the normal development cycle.

* I did do some Bugs work during my stint. It just had to be done, but
  it was probably <5% of my time.

* I wish I was as concise as mwhudson.

Have a good stint stub!

Gavin.
Follow ups

Re: The second Build Engineer's report!
From: Michael Hudson, 2009-11-18
Re: The second Build Engineer's report!
From: Steve McInerney, 2009-11-16