← Back to team overview

launchpad-dev team mailing list archive

Re: The second Build Engineer's report!

 

Gavin Panella wrote:
> I finished being build engineer last week. Here's a summary of the
> things I did:
> 
> * Fixed bug #422433 (Race condition when running two ec2test instances
>   very close together). This needed to be fixed before working on
>   making the test suite run in parallel across several machines.
> 
> * Investigated a couple of bugs: #419421 (Buildbot: over time memory
>   usage of the buildbot master process gets unreasonable) and #
> 
> * Got the jscheck builder running more frequently, and, after some
>   cajoling, got it to work. Michael Hudson did the ground work for
>   this. Fixing this problem taught be a lot about how buildbot works,
>   how to configure it, and meant I got to look at a *lot* of source
>   code.

Yay, another victim!

> * With the help of the LOSAs, got another of mwhudson's lpbuildbot
>   branches, use-update-sourcecode, merged and rolled out. This removed
>   quite a lot of code from lpbuildbot and replaced it with a single
>   call to utilities/update-sourcecode.
> 
> * Landed a lpbuildbot branch, avoid-deadlock, to fix a potential
>   problem in kill-test-pids where it could hang indefinitely. I can't
>   tell if this has ever affected us, but it was worth a small fix to
>   prevent it.
> 
> * Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should
>   not be set when calling test_on_merge). This has been reviewed but
>   not merged.
> 
> * Prepared a possible fix for bug #419408 (Buildbot: over time,
>   buildbot creates zombie processes) and bug #419408 (Buildbot: over
>   time, buildbot creates zombie processes). I think these two are
>   related; see comment 3 in bug 419408 for an explanation of the
>   possible culprit.

Um, you think #419408 and #419408 are related? :)  I think your
explanation on the bug makes sense (just commented).

>   There are actually two branches related to this, the fix itself, and
>   a port to staging which rolls in the fix and other changes to the
>   production configs. Neither have been merged, but the fix has been
>   reviewed.
> 
> * Investigated bug 433657 (tests regularly fail on buildbot with "no
>   space left on device"). Landed Launchpad branch log-statement-none
>   to disable PostgreSQL statement logging (which was set to 'all') to
>   see if that might help... but I haven't kept track of failures, so I
>   don't actually know. It should be possible to go back through the
>   build logs and figure it out.
> 
>   I also documented how to put statement logging back for those who
>   want it: https://dev.launchpad.net/Debugging
> 
>   RT #36179 has been filed to request disk space monitoring on the
>   slaves. It would be especially useful to get something like the disk
>   space usage report that baobab does when a disk fills up.
> 
> * Branch ec2-buildout moves lib/devscripts to a separate place in the
>   tree, so that it's another develop egg. The biggest driver for doing
>   this was so that it could run with a different Python version. As of
>   next Monday that will cease to be an issue, but I think it's still
>   useful to treat it as a separate project. There's no need to
>   separate it from the Launchpad tree right now, but doing so would be
>   quite easy.
> 
>   This branch is unfortunately not quite finished; hooking in the
>   tests to run was proving a hassle, but I think there's a way around
>   that (using subunit, yay). Just got to do it :-/
> 
>   I'm CHR next week so maybe I'll help the community by finishing this
>   ;)
> 
> * My pet project was trying to get the test suite to split itself up
>   and run on several machines in parallel, to reduce run time.
> 
>   I didn't make much tangible progress on this until the last couple
>   of weeks of my stint - only a bus load of reading code and docs -
>   but, with a lot of help from jml including a 2-day sprint in London,
>   something good has come out of it. There's a branch in review -
>   lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and
>   jml has an alternative approach at lp:~jml/launchpad/dirty-parry.

Ah yes, this is lurking like a menacing thing in my inbox currently... I
will get to it, I promise!

>   See the cover letter in the ec2-parry merge proposal for an idea of
>   how it works. There are two outstanding issues to resolve before
>   it'll be generally useful: security around the RPC mechanism needs
>   tightening up, and there's a problem where workers are not running
>   all the tests. I'll be dogfooding this myself to try and figure
>   these out, but if there are any other masochists out there maybe we
>   can squash these issues quicker than I can on my own.
> 
> Comments on being Build Engineer:
> 
> * Getting started was daunting. Suddenly having to actually know about
>   PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a
>   learning cliff-face, but a few things got me through. Figuring out
>   the jscheck issue helped me understand buildbot and be, frankly,
>   less scared of it. But most of all, having mwhudson and jml to talk
>   to was probably the most reassuring thing.
> 
> * For a lot of the BE stint I was fighting little fires (with my water
>   pistol of limited knowledge). I got an idea of what I imagine the
>   LOSAs feel like every day :-/ (Not the water pistol bit; the
>   fighting fires bit).
> 
>   I felt like I spent a lot of my time task-switching, and the lack of
>   tangible output was a bit of a downer. Coming after mwhudson, who
>   did a lot of build-related goodness, I put myself under a lot of
>   pressure to make a mark.

Maybe I'm just used to the feeling of fighting lots of little fires :-)
(I certainly seem to spend enough time doing it when I'm not BE).

>   I guess it's worth reminding future Build
>   Engineers that it's also about learning. The BuildEngineer wiki page
>   even states as an advantage that "Knowledge about the build system
>   is spread around the team."

Yes, I think this is definitely part of the goal, so if you know more
about the system now that's a success, even if you'd achieved nothing
(not that it sounds like this was the case).

> * I definitely think the BE role is worth it. It's a break from the
>   routine. I've learnt a ton that I can bring back to my normal role
>   in Bugs. I think I've made improvements to the build side of
>   Launchpad (though I wish I could have made more).
> 
> * Michael said in his report that "It's hard to get things done on the
>   infrastructure in week 4!". It was difficult to get things done in
>   the last *three* weeks of the 3.1.10 cycle because there was new
>   hardware, U1, Karmic, and a Launchpad release.

Oof.  I don't envy you that :)

>   Especially when it
>   comes to buildbot and PQM, much of the BE's role is LOSA intensive,
>   and, as I've already fed back to Gary, I didn't feel like I had the
>   right to push for attention from them for BE fixes (excepting
>   show-stoppers).

As spm said, I think it's always worth asking.  Though for most of the
last cycle the answer would have been "no".

> * Michael also said "not being able to land branches in week 4 is a
>   pain, even more than normal", and "... the build engineer's work is
>   sort of sideways to the main thrust of launchpad development".
> 
>   It might be beneficial if the BE role was 2 weeks out of sync with
>   the normal development cycle.

Hm, that's an attractive idea.  The downside that I see is that it might
torpedo two cycles of the BE's "normal" development rather than just one.

> * I did do some Bugs work during my stint. It just had to be done, but
>   it was probably <5% of my time.

Yeah, I think this in inevitable.

> * I wish I was as concise as mwhudson.

I don't think my terseness is a uniformly good thing :-)

> Have a good stint stub!

Indeed! Gavin and I are here to share your pain :)

Cheers,
mwh



Follow ups

References