launchpad-dev team mailing list archive

Thread
Date
Re: 810290 mail-scope after-action review

To: Martin Pool <mbp@xxxxxxxxxxxxx>
From: John Meinel <john@xxxxxxxxxxxxxxxxx>
Date: Wed, 3 Aug 2011 19:52:13 +0200
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAA9uavCYkvhfWUirNjY1=h4uUtAM+_LDJV5y5TzntJ6KtKi0yg@mail.gmail.com>
The only thing I would add is that when I was working on the Launchpad
up-to-date check, refactoring the code to pull out the online bits for
testing worked pretty well. Eg having a way to have fake results, etc. And
the code came out pretty nice because of it (reduced coupling, more
isolation of when it makes web requests vs when it thinks about state).
I was able to test it end-to-end manually, though. That seems harder to do
with the LP setup. Having a closer-to-production development environment
would certainly help.

John
=:->
On Aug 2, 2011 8:50 PM, "Martin Pool" <mbp@xxxxxxxxxxxxx> wrote:
> Robert suggests in today's Tuesday mail to work on making landings
> less risky. My branch
> <https://code.launchpad.net/~mbp/launchpad/mail-scope/+merge/70095>
> for bug <http://pad.lv/810290> broke on qastaging and needed to be
> rolled back after landing. I am going to post a summary here and I
> invite questions about it or suggestions for how I could do better.
>
> Intent: The intent of the change was to allow feature flags to be
> switched based on headers of incoming mail, so that we could do staged
> rollouts or risk mitigation of later mail-related features. In
> particular, I wanted to have this in place before doing later DKIM
> authentication work so that could be tested for a limited set of
> users, or turned off without a new deployment if it caused trouble.
>
> Summary: I started this three months ago as a part time project, put
> it up for review in
> <https://code.launchpad.net/~mbp/launchpad/mail-scope/+merge/60281>.
> On Monday I did the refactorings suggested by Robert, got (two)
> additional reviews, and then sent it to 'ec2 land'. I put a plan for
> qa onto the bug.
>
> Overnight (my time) this was deployed to qastaging and it turned out
> to cause a failure processing incoming mail on
> <https://bugs.launchpad.net/launchpad/+bug/810290/comments/4>.
>
> The crash is a fairly straightforward "object is not callable" that in
> Python typically means the code is just not reached during testing.
> Since this didn't trip in my local tests or in ec2 (or in buildbot?) I
> assume this path isn't actually reached by the test suite. There are
> unit tests but they must be too localized to check everything's hooked
> together properly.
>
> I did have some concerns as I was doing this change that this code
> wasn't easy to test or very well tested and apparently I should have
> listened to them. This covers both manual/interactive testing and
> unit testing.
>
> Outcome:
>
> * ~10h stall on new deployments to qastaging?
> * ~20m developer/losa work to diagnose the problem and roll it back
> (is that accurate?)
> * no disruption to production
>
> Discussion:
>
> * It's good this was caught on qastaging and did not break production.
>
> * It's unfortunate a deterministic error that could have been caught
> by a static checker (at least if there was a sufficiently good one in
> Python) or a test suite wasn't caught until it was on qas.
>
> * As far as I know this was done in accord with the Launchpad
> procedures (got reviews, went through ec2 etc); so slipping this far
> through the funnel wasn't due to "cheating".
>
> * I think the intent still makes sense and is consistent with
> Launchpad's architectural direction; obviously it's unfortunate if a
> risk-mitigation feature causes breakage. Some of the follow ons from
> here will probably be patches that make mail more testable or tested;
> those need to be individually non risky.
>
> * It was nice that putting a qa plan onto the bug let Gavin qa the
> bug for me and it was good that the qa procedure did catch the latent
> problem before it went to production. I will do this again, probably
> earlier in the pipeline.
>
> * Testing this on qastaging requires lockstep work with a LOSA to
> change the flags and to process incoming mail. There is a bug about
> letting developers change flag rules on *staging; it would be nice if
> there was a sufficiently cheap way to also have mail automatically
> processed when it arrives.
>
> * Some of the factors that make manual testing hard: requires
> tweaking zcml to let lp read from a mailbox; requires a real DKIM
> signature that is not trivial to produce on demand; integration tests
> of authenticated mail or tangled up with the side effects of doing the
> authenticated action.
>
> * Before doing other mail changes I'll try to make it easier to
> interactively and automatically test this.
>
> * ... your suggestions here?
>
> Martin
>
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help : https://help.launchpad.net/ListHelp
References

810290 mail-scope after-action review
From: Martin Pool, 2011-08-03