← Back to team overview

launchpad-dev team mailing list archive

Re: Legacy, performance, testing: 6 months of new critical bugs analysed

 

On Sat, Oct 22, 2011 at 11:01 AM, Francis J. Lacoste
<francis.lacoste@xxxxxxxxxxxxx> wrote:
> Hello launchpadders,
>
> As most of you are aware, I've been working on an analysis of our new
> critical bugs for a while now. (Seems like I started this at the end of
> August.) Anyway, I'm done collecting all the data and I have a draft
> analysis.
>
> I'm solliciting review of both the collected data, as well as the
> analysis and recommendations.
>
> The analysis is in a Google document, you can edit and leave comments in it.
>
> https://docs.google.com/a/canonical.com/document/d/1GNgTwk62WzG9oIN91bTZI4fNwfylYiSdXC56y9i_riQ
>
> The document is only accessible to Canonical employees, but there is a
> published version of the document available at
>
> https://docs.google.com/document/pub?id=1GNgTwk62WzG9oIN91bTZI4fNwfylYiSdXC56y9i_riQ
>
> You won't be able to comment inline there, but feel free to follow-up on
> the list.
>
> The actual data (in a spreadsheet) is linked from the analysis document.
>
> I'm joining a PDF version of the document, in case, anyone want to read
> it offline.
>
> tl;dr
>
> * Most of the new bugs (68%) are actually legacy issues lurking in our
> code base.
> * Performance and spotty test coverage represents together more than 50%
> of the cause of our new bugs. We should refocus maintenance on tackling
> performance problems, that's what is going to bring us the most bang for
> the bucks (even though it's not cheap).
> * As a team, we should increase our awareness of testing techniques and
> testing coverage. Always do TDD, maybe investigate ATDD to increase the
> coverage and documentation our the business rules we should be supporting.
> * We also need to pay more attention to how code is deployed, it's now
> very usual for scripts to be interrupted, and for the new and ancient
> version of the code to operate in parallel.

Thanks for putting so much time into this Francis! I'm *very* glad we
took the time to understand the issue rather than addressing the
symptoms. Something I'd really like to see is a reduction in the
overall number: I think there is an impact in having so many zomg bugs
(and we do have that many bugs: ignore the labels, the issue is real).
As I read the first pivot table 1/3 of the filed bugs do not get
fixed, and about 2/3rds of that is legacy bugs.

So what this says is '77% of the *increase* in criticals is due to
long standing existing defects' : its tech debt that we're *finally*
paying off. If we were closing fewer of the legacy criticals our
increase would be substantially higher.

So while I'm *totally* behind a renewed focus on performance - totally
totally totally - I wonder if perhaps performance and legacy bugs are
similar in that they are both debt we're paying for now - schemas,
code structure (lazy evalution), incomplete implementations etc.
Performance bugs perhaps want more schema changes, but equally other
correctness bugs need schema work too.

maintenance+support squads together are paying 14/29=48% of the
tech-debt listed as 'legacy', and doing that is taking 14/22=63% of
their combined output. To stay on top of the legacy critical bug
source then, we need a 100% increase in the legacy fix rate and that
isn't available from the existing maintenance squads no matter whether
we ask them to drop other sources of criticals or not. If we did not
have maintenance added criticals (6 items) and that translated 1:1
into legacy fixes we'd still be short 9 legacy bugfixes to keep the
legacy component flat.

So this says to me, we are really mining things we didn't do well
enough in the past, and it takes long enough to fix each one, that
until we hit the bottom of the mine, its going to be a standing
feature for us.

I agree with the recommendations to spend some more effort on the
safety nets of testing; the decreased use of doctests and increases in
unit tests should aid with maintenance overhead and avoiding known
problems is generally a positive thing. The SOA initiative will also
help us decouple things as we go which should help with
maintainability and reaction times.

What troubles me a bit is the unknown size of the legacy mine, and
that from the analysis we added 25% of the legacy volume criticals
from feature work. The great news is that all the ones you examined
were all fixed. I'd like us to make sure though, that we don't end up
adding performance debt - which can be particularly hard to fix.

The numbers don't really say we're safe from this - 26% of criticals
coming from changes (feature + maintenance) - is a large amount, and
features in particular are not just tweaking things, they are making
large changes, which adds up to a lot of risk. There are two aspects
to the feature rotation that have been worrying me for a while; one is
performance testing of new work (browser performance, data scaling -
the works), the other is that we rotate off right after users get the
feature. I think we should allow 10% of the feature time, or something
like that, so that after-release-adoption issues can be fixed from the
resources allocated to the feature. One way to do this would be to say
that:
 - After release feature squads spend 1-2 weeks doing polish and/or
general bugs (in the area, or even just criticals->high etc). At the
end of that time, they stay on the feature, doing this same stuff,
until all the critical bugs introduced/uncovered by the feature work
feature are fixed.

For the performance side, we could make performance/scalablity testing
a release criteria: we already agree that all pages done during a
feature must have a <1 sec 99th percentile and 5 second timeout.
Extending this to say that we've tested those pages with large
datasets would be a modest tweak and likely catch issues.

I think its ok that criticals found a few weeks later be handled by
the maintenance squads, which will include the erstwhile feature squad
that triggered them, but we should account for the majority of the
feature-related criticals in the resourcing of the feature - scaling
issues in particular can be curly and require weeks of work, something
maintenance mode, with its interrupts etc, is not suited to. And our
velocity measurements shouldn't be higher by not counting that work as
part of the feature :)

-Rob


Follow ups

References