launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #01631
The second Build Engineer's report!
I finished being build engineer last week. Here's a summary of the
things I did:
* Fixed bug #422433 (Race condition when running two ec2test instances
very close together). This needed to be fixed before working on
making the test suite run in parallel across several machines.
* Investigated a couple of bugs: #419421 (Buildbot: over time memory
usage of the buildbot master process gets unreasonable) and #
* Got the jscheck builder running more frequently, and, after some
cajoling, got it to work. Michael Hudson did the ground work for
this. Fixing this problem taught be a lot about how buildbot works,
how to configure it, and meant I got to look at a *lot* of source
code.
* With the help of the LOSAs, got another of mwhudson's lpbuildbot
branches, use-update-sourcecode, merged and rolled out. This removed
quite a lot of code from lpbuildbot and replaced it with a single
call to utilities/update-sourcecode.
* Landed a lpbuildbot branch, avoid-deadlock, to fix a potential
problem in kill-test-pids where it could hang indefinitely. I can't
tell if this has ever affected us, but it was worth a small fix to
prevent it.
* Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should
not be set when calling test_on_merge). This has been reviewed but
not merged.
* Prepared a possible fix for bug #419408 (Buildbot: over time,
buildbot creates zombie processes) and bug #419408 (Buildbot: over
time, buildbot creates zombie processes). I think these two are
related; see comment 3 in bug 419408 for an explanation of the
possible culprit.
There are actually two branches related to this, the fix itself, and
a port to staging which rolls in the fix and other changes to the
production configs. Neither have been merged, but the fix has been
reviewed.
* Investigated bug 433657 (tests regularly fail on buildbot with "no
space left on device"). Landed Launchpad branch log-statement-none
to disable PostgreSQL statement logging (which was set to 'all') to
see if that might help... but I haven't kept track of failures, so I
don't actually know. It should be possible to go back through the
build logs and figure it out.
I also documented how to put statement logging back for those who
want it: https://dev.launchpad.net/Debugging
RT #36179 has been filed to request disk space monitoring on the
slaves. It would be especially useful to get something like the disk
space usage report that baobab does when a disk fills up.
* Branch ec2-buildout moves lib/devscripts to a separate place in the
tree, so that it's another develop egg. The biggest driver for doing
this was so that it could run with a different Python version. As of
next Monday that will cease to be an issue, but I think it's still
useful to treat it as a separate project. There's no need to
separate it from the Launchpad tree right now, but doing so would be
quite easy.
This branch is unfortunately not quite finished; hooking in the
tests to run was proving a hassle, but I think there's a way around
that (using subunit, yay). Just got to do it :-/
I'm CHR next week so maybe I'll help the community by finishing this
;)
* My pet project was trying to get the test suite to split itself up
and run on several machines in parallel, to reduce run time.
I didn't make much tangible progress on this until the last couple
of weeks of my stint - only a bus load of reading code and docs -
but, with a lot of help from jml including a 2-day sprint in London,
something good has come out of it. There's a branch in review -
lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and
jml has an alternative approach at lp:~jml/launchpad/dirty-parry.
See the cover letter in the ec2-parry merge proposal for an idea of
how it works. There are two outstanding issues to resolve before
it'll be generally useful: security around the RPC mechanism needs
tightening up, and there's a problem where workers are not running
all the tests. I'll be dogfooding this myself to try and figure
these out, but if there are any other masochists out there maybe we
can squash these issues quicker than I can on my own.
Comments on being Build Engineer:
* Getting started was daunting. Suddenly having to actually know about
PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a
learning cliff-face, but a few things got me through. Figuring out
the jscheck issue helped me understand buildbot and be, frankly,
less scared of it. But most of all, having mwhudson and jml to talk
to was probably the most reassuring thing.
* For a lot of the BE stint I was fighting little fires (with my water
pistol of limited knowledge). I got an idea of what I imagine the
LOSAs feel like every day :-/ (Not the water pistol bit; the
fighting fires bit).
I felt like I spent a lot of my time task-switching, and the lack of
tangible output was a bit of a downer. Coming after mwhudson, who
did a lot of build-related goodness, I put myself under a lot of
pressure to make a mark. I guess it's worth reminding future Build
Engineers that it's also about learning. The BuildEngineer wiki page
even states as an advantage that "Knowledge about the build system
is spread around the team."
* I definitely think the BE role is worth it. It's a break from the
routine. I've learnt a ton that I can bring back to my normal role
in Bugs. I think I've made improvements to the build side of
Launchpad (though I wish I could have made more).
* Michael said in his report that "It's hard to get things done on the
infrastructure in week 4!". It was difficult to get things done in
the last *three* weeks of the 3.1.10 cycle because there was new
hardware, U1, Karmic, and a Launchpad release. Especially when it
comes to buildbot and PQM, much of the BE's role is LOSA intensive,
and, as I've already fed back to Gary, I didn't feel like I had the
right to push for attention from them for BE fixes (excepting
show-stoppers).
* Michael also said "not being able to land branches in week 4 is a
pain, even more than normal", and "... the build engineer's work is
sort of sideways to the main thrust of launchpad development".
It might be beneficial if the BE role was 2 weeks out of sync with
the normal development cycle.
* I did do some Bugs work during my stint. It just had to be done, but
it was probably <5% of my time.
* I wish I was as concise as mwhudson.
Have a good stint stub!
Gavin.
Follow ups