launchpad-dev team mailing list archive

Thread
Date

Re: Legacy, performance, testing: 6 months of new critical bugs analysed

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>
Date: Mon, 24 Oct 2011 14:54:46 -0400
Cc: Launchpad Development List <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAJ3HoZ1egYhXzgQdDzvZ6+V+fagHEROBAm_ftJ4358e6sTSiGw@mail.gmail.com>
Organization: Canonical
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1

Hi Robert,

Thanks for your analysis. Could you take some time to add your
recommendations to the document perhaps?

Some more comments below.

On 11-10-23 06:02 PM, Robert Collins wrote:
> Thanks for putting so much time into this Francis! I'm *very* glad we
> took the time to understand the issue rather than addressing the
> symptoms. Something I'd really like to see is a reduction in the
> overall number: I think there is an impact in having so many zomg bugs
> (and we do have that many bugs: ignore the labels, the issue is real).
> As I read the first pivot table 1/3 of the filed bugs do not get
> fixed, and about 2/3rds of that is legacy bugs.
> 
> So what this says is '77% of the *increase* in criticals is due to
> long standing existing defects' : its tech debt that we're *finally*
> paying off. If we were closing fewer of the legacy criticals our
> increase would be substantially higher.
> 
> So while I'm *totally* behind a renewed focus on performance - totally
> totally totally - I wonder if perhaps performance and legacy bugs are
> similar in that they are both debt we're paying for now - schemas,
> code structure (lazy evalution), incomplete implementations etc.
> Performance bugs perhaps want more schema changes, but equally other
> correctness bugs need schema work too.

Yes, I agree with this analysis. I suggested focusing on performance for
two reasons:

1) insufficient_scaling is the individual category with the most bugs
falling under it (24%, next one is missing_integration_test at 14%)
2) they are very easy to identify

But I agree, that any work spent on difficult areas (badly factored,
spotty test coverage, etc) is probably worthwhile as it will remove
pre-emptively a bunch of bugs meeting our Critical criteria from being
fixed.

It's just that performance problems are very easy to spot, and we have
several well-known patterns on how to address them.

> 
> maintenance+support squads together are paying 14/29=48% of the
> tech-debt listed as 'legacy', and doing that is taking 14/22=63% of
> their combined output. To stay on top of the legacy critical bug
> source then, we need a 100% increase in the legacy fix rate and that
> isn't available from the existing maintenance squads no matter whether
> we ask them to drop other sources of criticals or not. If we did not
> have maintenance added criticals (6 items) and that translated 1:1
> into legacy fixes we'd still be short 9 legacy bugfixes to keep the
> legacy component flat.
> 
> So this says to me, we are really mining things we didn't do well
> enough in the past, and it takes long enough to fix each one, that
> until we hit the bottom of the mine, its going to be a standing
> feature for us.

Yes, I agree with that characterisation. But I would be hard-pressed to
change the ratio between feature vs maintenance. While addressing
tech-debt is important for the growth of the project, we also need to
make changes to make sure that project is still relevant in the evolving
landscape.

> 
> I agree with the recommendations to spend some more effort on the
> safety nets of testing; the decreased use of doctests and increases in
> unit tests should aid with maintenance overhead and avoiding known
> problems is generally a positive thing. The SOA initiative will also
> help us decouple things as we go which should help with
> maintainability and reaction times.

Again, I agree. I'd really like TDD to be used as standard, but that's
very hard to "enforce" in a distributed environment.

> 
> What troubles me a bit is the unknown size of the legacy mine, and
> that from the analysis we added 25% of the legacy volume criticals
> from feature work. The great news is that all the ones you examined
> were all fixed. I'd like us to make sure though, that we don't end up
> adding performance debt - which can be particularly hard to fix.
> 
> The numbers don't really say we're safe from this - 26% of criticals
> coming from changes (feature + maintenance) - is a large amount, and
> features in particular are not just tweaking things, they are making
> large changes, which adds up to a lot of risk.

Actually, you should probably add-up the thunderdome category to this
(6%) since that was kind of mini-feature sprint in itself. That makes
that 33% of the new criticals are introduced as part of major new work.

> There are two aspects
> to the feature rotation that have been worrying me for a while; one is
> performance testing of new work (browser performance, data scaling -
> the works), the other is that we rotate off right after users get the
> feature. I think we should allow 10% of the feature time, or something
> like that, so that after-release-adoption issues can be fixed from the
> resources allocated to the feature. One way to do this would be to say
> that:
>  - After release feature squads spend 1-2 weeks doing polish and/or
> general bugs (in the area, or even just criticals->high etc). At the
> end of that time, they stay on the feature, doing this same stuff,
> until all the critical bugs introduced/uncovered by the feature work
> feature are fixed.

If understand this correctly, you are saying that, the maintenance squad
shouldn't not start a new feature, until the feature squad ready to take
their place have fixed all Criticals related to the feature (with a
minimum of 2 weeks to uncover issues)?

I think it's probably worth a try. It would be a relatively low-impact
way of tweaking the feature vs maintenance ratio.

> 
> For the performance side, we could make performance/scalablity testing
> a release criteria: we already agree that all pages done during a
> feature must have a <1 sec 99th percentile and 5 second timeout.
> Extending this to say that we've tested those pages with large
> datasets would be a modest tweak and likely catch issues.

That's something that Matthew and Diogo can add to the release checklist.

Are we enforcing the 5seconds timeout in any way at this stage?

> 
> I think its ok that criticals found a few weeks later be handled by
> the maintenance squads, which will include the erstwhile feature squad
> that triggered them, but we should account for the majority of the
> feature-related criticals in the resourcing of the feature - scaling
> issues in particular can be curly and require weeks of work, something
> maintenance mode, with its interrupts etc, is not suited to. And our
> velocity measurements shouldn't be higher by not counting that work as
> part of the feature :)
> 

Agreed and your 2-weeks+ wind down period addresses that.

Cheers

-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

Attachment: signature.asc
Description: OpenPGP digital signature

Follow ups

Re: Legacy, performance, testing: 6 months of new critical bugs analysed
From: Robert Collins, 2011-10-24

References

Legacy, performance, testing: 6 months of new critical bugs analysed
From: Francis J. Lacoste, 2011-10-21
Re: Legacy, performance, testing: 6 months of new critical bugs analysed
From: Robert Collins, 2011-10-23