← Back to team overview

ubuntu-phone team mailing list archive

Re: Catching CPU run-aways on Touch

 

On Thu, Sep 5, 2013 at 8:04 PM, Evan Dandrea <ev@xxxxxxxxxx> wrote:
> On 5 September 2013 18:35, Steve Langasek <steve.langasek@xxxxxxxxxx> wrote:
>>> Is this a proposal for 13.10?
>>
>> I think it's unrealistic to think anything discussed here would land for
>> 13.10.  We already have plenty of other things on our plate that are on the
>> critical path for 13.10. :)
>
> Agreed :)
>
>> Anyway, point taken that we shouldn't deploy something that could cause
>> processes that were previously perfectly reliable to suddenly be killed by
>> some other process which arbitrarily decides they're "misbehaving", thus
>> sending the whole system into turmoil.  If we're going to go around killing
>> system processes, we should be sure that the cure isn't worse than the
>> disease.
>

It' still an interesting idea, much like the chaos monkeys commonly
used in cloud infrastructures, essentially a predator that either
randomly or according to some criteria terminates processes and causes
havoc.

> We should also be careful to not spin out of control ourselves, trying
> to play whack-a-mole with an out of control process. Presumably this
> is handled by upstart's respawn stanza?
>
>> Certainly; any salient examples are going to be bugs we already know about,
>> and thus which are likely to be fixed or in progress.
>>
>> The question I have is: would a monitor/killer for runaway processes have
>> improved our response to these bugs?  Would it have resulted in earlier
>> detection?  Easier diagnosis?  Faster fixing?  Would such monitoring tell us
>> about other such bugs that we are currently unaware of and need to be?
>
> I could not agree more with the data-driven approach here. I think
> you're absolutely spot on to suggest this needs to prove its value
> with some concrete numbers.
>
>> I'm not convinced that the answer is "yes" to any of these.  Obviously, the
>> only way to know if it would tell us about bugs we're unaware of is to try
>> it and see :), but I think the fact that we are currently unaware of them is
>> already a strong indicator that they should not be a high priority, because
>> if they were high-impact they would organically rise to our attention.
>
> So knowing what problems are out there is half of what something like
> this gives us. https://errors.ubuntu.com has discovered lots of
> serious problems not caught by our pre-release QA.
>
> The other half is knowing how critical each problem is. Some subset of
> the problems out there may rise to our attention, but we wont know how
> important they are because we wont have a clear picture of how many
> systems they affect. Engineering resource is finite. We have to make
> tough decisions on which issues to fix are going to get the most bang
> for the buck.
>

+1.

Cheers,

  Thomas

> --
> Mailing list: https://launchpad.net/~ubuntu-phone
> Post to     : ubuntu-phone@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~ubuntu-phone
> More help   : https://help.launchpad.net/ListHelp


References