← Back to team overview

launchpad-dev team mailing list archive

810290 mail-scope after-action review

 

Robert suggests in today's Tuesday mail to work on making landings
less risky.  My branch
<https://code.launchpad.net/~mbp/launchpad/mail-scope/+merge/70095>
for bug <http://pad.lv/810290> broke on qastaging and needed to be
rolled back after landing.  I am going to post a summary here and I
invite questions about it or suggestions for how I could do better.

Intent: The intent of the change was to allow feature flags to be
switched based on headers of incoming mail, so that we could do staged
rollouts or risk mitigation of later mail-related features.  In
particular, I wanted to have this in place before doing later DKIM
authentication work so that could be tested for a limited set of
users, or turned off without a new deployment if it caused trouble.

Summary: I started this three months ago as a part time project, put
it up for review in
<https://code.launchpad.net/~mbp/launchpad/mail-scope/+merge/60281>.
On Monday I did the refactorings suggested by Robert, got (two)
additional reviews, and then sent it to 'ec2 land'.  I put a plan for
qa onto the bug.

Overnight (my time) this was deployed to qastaging and it turned out
to cause a failure processing incoming mail on
<https://bugs.launchpad.net/launchpad/+bug/810290/comments/4>.

The crash is a fairly straightforward "object is not callable" that in
Python typically means the code is just not reached during testing.
Since this didn't trip in my local tests or in ec2 (or in buildbot?) I
assume this path isn't actually reached by the test suite.  There are
unit tests but they must be too localized to check everything's hooked
together properly.

I did have some concerns as I was doing this change that this code
wasn't easy to test or very well tested and apparently I should have
listened to them.  This covers both manual/interactive testing and
unit testing.

Outcome:

 * ~10h stall on new deployments to qastaging?
 * ~20m developer/losa work to diagnose the problem and roll it back
(is that accurate?)
 * no disruption to production

Discussion:

 * It's good this was caught on qastaging and did not break production.

 * It's unfortunate a deterministic error that could have been caught
by a static checker (at least if there was a sufficiently good one in
Python) or a test suite wasn't caught until it was on qas.

 * As far as I know this was done in accord with the Launchpad
procedures (got reviews, went through ec2 etc); so slipping this far
through the funnel wasn't due to "cheating".

 * I think the intent still makes sense and is consistent with
Launchpad's architectural direction; obviously it's unfortunate if a
risk-mitigation feature causes breakage.  Some of the follow ons from
here will probably be patches that make mail more testable or tested;
those need to be individually non risky.

 * It was nice that putting a qa plan onto the bug let Gavin qa the
bug for me and it was good that the qa procedure did catch the latent
problem before it went to production.  I will do this again, probably
earlier in the pipeline.

 * Testing this on qastaging requires lockstep work with a LOSA to
change the flags and to process incoming mail.  There is a bug about
letting developers change flag rules on *staging; it would be nice if
there was a sufficiently cheap way to also have mail automatically
processed when it arrives.

 * Some of the factors that make manual testing hard: requires
tweaking zcml to let lp read from a mailbox; requires a real DKIM
signature that is not trivial to produce on demand; integration tests
of authenticated mail or tangled up with the side effects of doing the
authenticated action.

 * Before doing other mail changes I'll try to make it easier to
interactively and automatically test this.

 * ... your suggestions here?

Martin


Follow ups