← Back to team overview

ubuntu-phone team mailing list archive

Re: Catching CPU run-aways on Touch


On 5 September 2013 18:35, Steve Langasek <steve.langasek@xxxxxxxxxx> wrote:
>> Is this a proposal for 13.10?
> I think it's unrealistic to think anything discussed here would land for
> 13.10.  We already have plenty of other things on our plate that are on the
> critical path for 13.10. :)

Agreed :)

> Anyway, point taken that we shouldn't deploy something that could cause
> processes that were previously perfectly reliable to suddenly be killed by
> some other process which arbitrarily decides they're "misbehaving", thus
> sending the whole system into turmoil.  If we're going to go around killing
> system processes, we should be sure that the cure isn't worse than the
> disease.

We should also be careful to not spin out of control ourselves, trying
to play whack-a-mole with an out of control process. Presumably this
is handled by upstart's respawn stanza?

> Certainly; any salient examples are going to be bugs we already know about,
> and thus which are likely to be fixed or in progress.
> The question I have is: would a monitor/killer for runaway processes have
> improved our response to these bugs?  Would it have resulted in earlier
> detection?  Easier diagnosis?  Faster fixing?  Would such monitoring tell us
> about other such bugs that we are currently unaware of and need to be?

I could not agree more with the data-driven approach here. I think
you're absolutely spot on to suggest this needs to prove its value
with some concrete numbers.

> I'm not convinced that the answer is "yes" to any of these.  Obviously, the
> only way to know if it would tell us about bugs we're unaware of is to try
> it and see :), but I think the fact that we are currently unaware of them is
> already a strong indicator that they should not be a high priority, because
> if they were high-impact they would organically rise to our attention.

So knowing what problems are out there is half of what something like
this gives us. https://errors.ubuntu.com has discovered lots of
serious problems not caught by our pre-release QA.

The other half is knowing how critical each problem is. Some subset of
the problems out there may rise to our attention, but we wont know how
important they are because we wont have a clear picture of how many
systems they affect. Engineering resource is finite. We have to make
tough decisions on which issues to fix are going to get the most bang
for the buck.

Follow ups