← Back to team overview

launchpad-dev team mailing list archive

Re: Legacy, performance, testing: 6 months of new critical bugs analysed

 

Hi Robert,

Thanks for your analysis. Could you take some time to add your
recommendations to the document perhaps?

Some more comments below.

On 11-10-23 06:02 PM, Robert Collins wrote:
> Thanks for putting so much time into this Francis! I'm *very* glad we
> took the time to understand the issue rather than addressing the
> symptoms. Something I'd really like to see is a reduction in the
> overall number: I think there is an impact in having so many zomg bugs
> (and we do have that many bugs: ignore the labels, the issue is real).
> As I read the first pivot table 1/3 of the filed bugs do not get
> fixed, and about 2/3rds of that is legacy bugs.
> 
> So what this says is '77% of the *increase* in criticals is due to
> long standing existing defects' : its tech debt that we're *finally*
> paying off. If we were closing fewer of the legacy criticals our
> increase would be substantially higher.
> 
> So while I'm *totally* behind a renewed focus on performance - totally
> totally totally - I wonder if perhaps performance and legacy bugs are
> similar in that they are both debt we're paying for now - schemas,
> code structure (lazy evalution), incomplete implementations etc.
> Performance bugs perhaps want more schema changes, but equally other
> correctness bugs need schema work too.

Yes, I agree with this analysis. I suggested focusing on performance for
two reasons:

1) insufficient_scaling is the individual category with the most bugs
falling under it (24%, next one is missing_integration_test at 14%)
2) they are very easy to identify

But I agree, that any work spent on difficult areas (badly factored,
spotty test coverage, etc) is probably worthwhile as it will remove
pre-emptively a bunch of bugs meeting our Critical criteria from being
fixed.

It's just that performance problems are very easy to spot, and we have
several well-known patterns on how to address them.

> 
> maintenance+support squads together are paying 14/29=48% of the
> tech-debt listed as 'legacy', and doing that is taking 14/22=63% of
> their combined output. To stay on top of the legacy critical bug
> source then, we need a 100% increase in the legacy fix rate and that
> isn't available from the existing maintenance squads no matter whether
> we ask them to drop other sources of criticals or not. If we did not
> have maintenance added criticals (6 items) and that translated 1:1
> into legacy fixes we'd still be short 9 legacy bugfixes to keep the
> legacy component flat.
> 
> So this says to me, we are really mining things we didn't do well
> enough in the past, and it takes long enough to fix each one, that
> until we hit the bottom of the mine, its going to be a standing
> feature for us.

Yes, I agree with that characterisation. But I would be hard-pressed to
change the ratio between feature vs maintenance. While addressing
tech-debt is important for the growth of the project, we also need to
make changes to make sure that project is still relevant in the evolving
landscape.

> 
> I agree with the recommendations to spend some more effort on the
> safety nets of testing; the decreased use of doctests and increases in
> unit tests should aid with maintenance overhead and avoiding known
> problems is generally a positive thing. The SOA initiative will also
> help us decouple things as we go which should help with
> maintainability and reaction times.

Again, I agree. I'd really like TDD to be used as standard, but that's
very hard to "enforce" in a distributed environment.

> 
> What troubles me a bit is the unknown size of the legacy mine, and
> that from the analysis we added 25% of the legacy volume criticals
> from feature work. The great news is that all the ones you examined
> were all fixed. I'd like us to make sure though, that we don't end up
> adding performance debt - which can be particularly hard to fix.
> 
> The numbers don't really say we're safe from this - 26% of criticals
> coming from changes (feature + maintenance) - is a large amount, and
> features in particular are not just tweaking things, they are making
> large changes, which adds up to a lot of risk.

Actually, you should probably add-up the thunderdome category to this
(6%) since that was kind of mini-feature sprint in itself. That makes
that 33% of the new criticals are introduced as part of major new work.

> There are two aspects
> to the feature rotation that have been worrying me for a while; one is
> performance testing of new work (browser performance, data scaling -
> the works), the other is that we rotate off right after users get the
> feature. I think we should allow 10% of the feature time, or something
> like that, so that after-release-adoption issues can be fixed from the
> resources allocated to the feature. One way to do this would be to say
> that:
>  - After release feature squads spend 1-2 weeks doing polish and/or
> general bugs (in the area, or even just criticals->high etc). At the
> end of that time, they stay on the feature, doing this same stuff,
> until all the critical bugs introduced/uncovered by the feature work
> feature are fixed.

If understand this correctly, you are saying that, the maintenance squad
shouldn't not start a new feature, until the feature squad ready to take
their place have fixed all Criticals related to the feature (with a
minimum of 2 weeks to uncover issues)?

I think it's probably worth a try. It would be a relatively low-impact
way of tweaking the feature vs maintenance ratio.

> 
> For the performance side, we could make performance/scalablity testing
> a release criteria: we already agree that all pages done during a
> feature must have a <1 sec 99th percentile and 5 second timeout.
> Extending this to say that we've tested those pages with large
> datasets would be a modest tweak and likely catch issues.

That's something that Matthew and Diogo can add to the release checklist.

Are we enforcing the 5seconds timeout in any way at this stage?

> 
> I think its ok that criticals found a few weeks later be handled by
> the maintenance squads, which will include the erstwhile feature squad
> that triggered them, but we should account for the majority of the
> feature-related criticals in the resourcing of the feature - scaling
> issues in particular can be curly and require weeks of work, something
> maintenance mode, with its interrupts etc, is not suited to. And our
> velocity measurements shouldn't be higher by not counting that work as
> part of the feature :)
> 

Agreed and your 2-weeks+ wind down period addresses that.

Cheers

-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

Attachment: signature.asc
Description: OpenPGP digital signature


Follow ups

References