← Back to team overview

launchpad-dev team mailing list archive

Re: Legacy, performance, testing: 6 months of new critical bugs analysed

 

On Tue, Oct 25, 2011 at 7:54 AM, Francis J. Lacoste
<francis.lacoste@xxxxxxxxxxxxx> wrote:
> Hi Robert,
>
> Thanks for your analysis. Could you take some time to add your
> recommendations to the document perhaps?

Sure, I wanted a little discussion first :)
>> So while I'm *totally* behind a renewed focus on performance - totally
>> totally totally - I wonder if perhaps performance and legacy bugs are
>> similar in that they are both debt we're paying for now - schemas,
>> code structure (lazy evalution), incomplete implementations etc.
>> Performance bugs perhaps want more schema changes, but equally other
>> correctness bugs need schema work too.
>
> Yes, I agree with this analysis. I suggested focusing on performance for
> two reasons:
>
> 1) insufficient_scaling is the individual category with the most bugs
> falling under it (24%, next one is missing_integration_test at 14%)
> 2) they are very easy to identify
>
> But I agree, that any work spent on difficult areas (badly factored,
> spotty test coverage, etc) is probably worthwhile as it will remove
> pre-emptively a bunch of bugs meeting our Critical criteria from being
> fixed.
>
> It's just that performance problems are very easy to spot, and we have
> several well-known patterns on how to address them.

Fair enough. I hope it won't add much friction to folk choosing what
to work on next :) (I really can't argue against this much given my
obsession on performance :)).

>> So this says to me, we are really mining things we didn't do well
>> enough in the past, and it takes long enough to fix each one, that
>> until we hit the bottom of the mine, its going to be a standing
>> feature for us.
>
> Yes, I agree with that characterisation. But I would be hard-pressed to
> change the ratio between feature vs maintenance. While addressing
> tech-debt is important for the growth of the project, we also need to
> make changes to make sure that project is still relevant in the evolving
> landscape.

Yup, totally. However this does provide actual data on the size of
team needed to stay afloat, something we didn't have before.

>>
>> I agree with the recommendations to spend some more effort on the
>> safety nets of testing; the decreased use of doctests and increases in
>> unit tests should aid with maintenance overhead and avoiding known
>> problems is generally a positive thing. The SOA initiative will also
>> help us decouple things as we go which should help with
>> maintainability and reaction times.
>
> Again, I agree. I'd really like TDD to be used as standard, but that's
> very hard to "enforce" in a distributed environment.

We can do more pair programming at epics and sprints. That generally
drives more testing because the extra person sees more ways it can
fail.

>> The numbers don't really say we're safe from this - 26% of criticals
>> coming from changes (feature + maintenance) - is a large amount, and
>> features in particular are not just tweaking things, they are making
>> large changes, which adds up to a lot of risk.
>
> Actually, you should probably add-up the thunderdome category to this
> (6%) since that was kind of mini-feature sprint in itself. That makes
> that 33% of the new criticals are introduced as part of major new work.

Sure. I kindof figured we should expect some fallout from such events
:). Given that we don't have any follow-on mechanism for supporting
changes started there, it doesn't really speak to overall team
structure. But perhaps it suggests we shouldn't use getting together
as a way to land lots of overdue/exciting work : the fallout is quite
noticable.

>> There are two aspects
>> to the feature rotation that have been worrying me for a while; one is
>> performance testing of new work (browser performance, data scaling -
>> the works), the other is that we rotate off right after users get the
>> feature. I think we should allow 10% of the feature time, or something
>> like that, so that after-release-adoption issues can be fixed from the
>> resources allocated to the feature. One way to do this would be to say
>> that:
>>  - After release feature squads spend 1-2 weeks doing polish and/or
>> general bugs (in the area, or even just criticals->high etc). At the
>> end of that time, they stay on the feature, doing this same stuff,
>> until all the critical bugs introduced/uncovered by the feature work
>> feature are fixed.
>
> If understand this correctly, you are saying that, the maintenance squad
> shouldn't not start a new feature, until the feature squad ready to take
> their place have fixed all Criticals related to the feature (with a
> minimum of 2 weeks to uncover issues)?

Yes, thats right. And the feature squad should still be prioritising
work related to their feature during that period.

> I think it's probably worth a try. It would be a relatively low-impact
> way of tweaking the feature vs maintenance ratio.

Cool!

>>
>> For the performance side, we could make performance/scalablity testing
>> a release criteria: we already agree that all pages done during a
>> feature must have a <1 sec 99th percentile and 5 second timeout.
>> Extending this to say that we've tested those pages with large
>> datasets would be a modest tweak and likely catch issues.
>
> That's something that Matthew and Diogo can add to the release checklist.
>
> Are we enforcing the 5seconds timeout in any way at this stage?

As soon as we have FDT wrapped up (in particular the slony upgrade)
we'll look at some automated reporting around new page ids.

Cheers,
Rob


References