launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #01662
Re: The second Build Engineer's report!
Gavin Panella wrote:
> I finished being build engineer last week. Here's a summary of the
> things I did:
>
> * Fixed bug #422433 (Race condition when running two ec2test instances
> very close together). This needed to be fixed before working on
> making the test suite run in parallel across several machines.
>
> * Investigated a couple of bugs: #419421 (Buildbot: over time memory
> usage of the buildbot master process gets unreasonable) and #
>
> * Got the jscheck builder running more frequently, and, after some
> cajoling, got it to work. Michael Hudson did the ground work for
> this. Fixing this problem taught be a lot about how buildbot works,
> how to configure it, and meant I got to look at a *lot* of source
> code.
Yay, another victim!
> * With the help of the LOSAs, got another of mwhudson's lpbuildbot
> branches, use-update-sourcecode, merged and rolled out. This removed
> quite a lot of code from lpbuildbot and replaced it with a single
> call to utilities/update-sourcecode.
>
> * Landed a lpbuildbot branch, avoid-deadlock, to fix a potential
> problem in kill-test-pids where it could hang indefinitely. I can't
> tell if this has ever affected us, but it was worth a small fix to
> prevent it.
>
> * Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should
> not be set when calling test_on_merge). This has been reviewed but
> not merged.
>
> * Prepared a possible fix for bug #419408 (Buildbot: over time,
> buildbot creates zombie processes) and bug #419408 (Buildbot: over
> time, buildbot creates zombie processes). I think these two are
> related; see comment 3 in bug 419408 for an explanation of the
> possible culprit.
Um, you think #419408 and #419408 are related? :) I think your
explanation on the bug makes sense (just commented).
> There are actually two branches related to this, the fix itself, and
> a port to staging which rolls in the fix and other changes to the
> production configs. Neither have been merged, but the fix has been
> reviewed.
>
> * Investigated bug 433657 (tests regularly fail on buildbot with "no
> space left on device"). Landed Launchpad branch log-statement-none
> to disable PostgreSQL statement logging (which was set to 'all') to
> see if that might help... but I haven't kept track of failures, so I
> don't actually know. It should be possible to go back through the
> build logs and figure it out.
>
> I also documented how to put statement logging back for those who
> want it: https://dev.launchpad.net/Debugging
>
> RT #36179 has been filed to request disk space monitoring on the
> slaves. It would be especially useful to get something like the disk
> space usage report that baobab does when a disk fills up.
>
> * Branch ec2-buildout moves lib/devscripts to a separate place in the
> tree, so that it's another develop egg. The biggest driver for doing
> this was so that it could run with a different Python version. As of
> next Monday that will cease to be an issue, but I think it's still
> useful to treat it as a separate project. There's no need to
> separate it from the Launchpad tree right now, but doing so would be
> quite easy.
>
> This branch is unfortunately not quite finished; hooking in the
> tests to run was proving a hassle, but I think there's a way around
> that (using subunit, yay). Just got to do it :-/
>
> I'm CHR next week so maybe I'll help the community by finishing this
> ;)
>
> * My pet project was trying to get the test suite to split itself up
> and run on several machines in parallel, to reduce run time.
>
> I didn't make much tangible progress on this until the last couple
> of weeks of my stint - only a bus load of reading code and docs -
> but, with a lot of help from jml including a 2-day sprint in London,
> something good has come out of it. There's a branch in review -
> lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and
> jml has an alternative approach at lp:~jml/launchpad/dirty-parry.
Ah yes, this is lurking like a menacing thing in my inbox currently... I
will get to it, I promise!
> See the cover letter in the ec2-parry merge proposal for an idea of
> how it works. There are two outstanding issues to resolve before
> it'll be generally useful: security around the RPC mechanism needs
> tightening up, and there's a problem where workers are not running
> all the tests. I'll be dogfooding this myself to try and figure
> these out, but if there are any other masochists out there maybe we
> can squash these issues quicker than I can on my own.
>
> Comments on being Build Engineer:
>
> * Getting started was daunting. Suddenly having to actually know about
> PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a
> learning cliff-face, but a few things got me through. Figuring out
> the jscheck issue helped me understand buildbot and be, frankly,
> less scared of it. But most of all, having mwhudson and jml to talk
> to was probably the most reassuring thing.
>
> * For a lot of the BE stint I was fighting little fires (with my water
> pistol of limited knowledge). I got an idea of what I imagine the
> LOSAs feel like every day :-/ (Not the water pistol bit; the
> fighting fires bit).
>
> I felt like I spent a lot of my time task-switching, and the lack of
> tangible output was a bit of a downer. Coming after mwhudson, who
> did a lot of build-related goodness, I put myself under a lot of
> pressure to make a mark.
Maybe I'm just used to the feeling of fighting lots of little fires :-)
(I certainly seem to spend enough time doing it when I'm not BE).
> I guess it's worth reminding future Build
> Engineers that it's also about learning. The BuildEngineer wiki page
> even states as an advantage that "Knowledge about the build system
> is spread around the team."
Yes, I think this is definitely part of the goal, so if you know more
about the system now that's a success, even if you'd achieved nothing
(not that it sounds like this was the case).
> * I definitely think the BE role is worth it. It's a break from the
> routine. I've learnt a ton that I can bring back to my normal role
> in Bugs. I think I've made improvements to the build side of
> Launchpad (though I wish I could have made more).
>
> * Michael said in his report that "It's hard to get things done on the
> infrastructure in week 4!". It was difficult to get things done in
> the last *three* weeks of the 3.1.10 cycle because there was new
> hardware, U1, Karmic, and a Launchpad release.
Oof. I don't envy you that :)
> Especially when it
> comes to buildbot and PQM, much of the BE's role is LOSA intensive,
> and, as I've already fed back to Gary, I didn't feel like I had the
> right to push for attention from them for BE fixes (excepting
> show-stoppers).
As spm said, I think it's always worth asking. Though for most of the
last cycle the answer would have been "no".
> * Michael also said "not being able to land branches in week 4 is a
> pain, even more than normal", and "... the build engineer's work is
> sort of sideways to the main thrust of launchpad development".
>
> It might be beneficial if the BE role was 2 weeks out of sync with
> the normal development cycle.
Hm, that's an attractive idea. The downside that I see is that it might
torpedo two cycles of the BE's "normal" development rather than just one.
> * I did do some Bugs work during my stint. It just had to be done, but
> it was probably <5% of my time.
Yeah, I think this in inevitable.
> * I wish I was as concise as mwhudson.
I don't think my terseness is a uniformly good thing :-)
> Have a good stint stub!
Indeed! Gavin and I are here to share your pain :)
Cheers,
mwh
Follow ups
References