ubuntu-phone team mailing list archive

Thread
Date

Re: Catching CPU run-aways on Touch

To: Tony Espy <espy@xxxxxxxxxxxxx>
From: Steve Langasek <steve.langasek@xxxxxxxxxx>
Date: Thu, 5 Sep 2013 10:35:03 -0700
Cc: Oliver Grawert <ogra@xxxxxxxxxxxxx>, ubuntu-phone@xxxxxxxxxxxxxxxxxxx, Martin Pitt <martin.pitt@xxxxxxxxxx>, Dmitrijs Ledkovs <dmitrijs.ledkovs@xxxxxxxxxxxxx>, Evan Dandrea <ev@xxxxxxxxxx>, Adam Conrad <adconrad@xxxxxxxxxxxxx>, Brian Murray <brian@xxxxxxxxxxxxx>
In-reply-to: <52275755.2010404@canonical.com>
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Sep 04, 2013 at 11:52:53AM -0400, Tony Espy wrote:
> On 09/04/2013 05:49 AM, Evan Dandrea wrote:
> > In another discussion, James Hunt raised the possibility of
> > periodically checking for runaway processes on Touch, killing those
> > consuming 100% CPU while creating a report to be sent to
> > https://errors.ubuntu.com.

> > I've summarised the key points of that discussion here into a
> > proposal. The hope of this is that it gives everyone a chance to
> > provide input.

> Is this a proposal for 13.10?

I think it's unrealistic to think anything discussed here would land for
13.10.  We already have plenty of other things on our plate that are on the
critical path for 13.10. :)

> I understand the basic reasoning, but wantonly killing system processes
> and then hoping that the system will always gracefully recover sounds a
> bit risky to me.

> Many system service have complex start-up sequences, and although they
> *should* handle restarts properly, they may not, potentially leaving the
> device in an inconsistent state.  Processes handled by upstart should
> work, but what about helper processes ( eg. dhclient ), are all
> guaranteed to be automatically re-started?

If the system does not behave sanely when one of its processes dies
unexpectedly, that's a serious bug that we need to fix.  Whether or not this
particular runaway-killer is implemented, processes may die at any time due
to bugs (e.g., SIGSEGV being raised), and the system needs to be robust in
the face of such problems.

I can't say that there is currently any *guarantee* that this is how all
components of the system operate, but it is certainly the case that Ubuntu
has been designed with this kind of graceful failure in mind.  If we're not
confident that this is how the system actually behaves, maybe we should be
testing that.

Anyway, point taken that we shouldn't deploy something that could cause
processes that were previously perfectly reliable to suddenly be killed by
some other process which arbitrarily decides they're "misbehaving", thus
sending the whole system into turmoil.  If we're going to go around killing
system processes, we should be sure that the cure isn't worse than the
disease.

> > == Examples ==

> > There are a few examples of this problem biting us already.

> Two of the three bugs listed below "had" bitten us, but have been fixed.

Certainly; any salient examples are going to be bugs we already know about,
and thus which are likely to be fixed or in progress.

The question I have is: would a monitor/killer for runaway processes have
improved our response to these bugs?  Would it have resulted in earlier
detection?  Easier diagnosis?  Faster fixing?  Would such monitoring tell us
about other such bugs that we are currently unaware of and need to be?

I'm not convinced that the answer is "yes" to any of these.  Obviously, the
only way to know if it would tell us about bugs we're unaware of is to try
it and see :), but I think the fact that we are currently unaware of them is
already a strong indicator that they should not be a high priority, because
if they were high-impact they would organically rise to our attention.

Cheers,
-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
Ubuntu Developer                                    http://www.debian.org/
slangasek@xxxxxxxxxx                                     vorlon@xxxxxxxxxx

Attachment: signature.asc
Description: Digital signature

Follow ups

Re: Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-05

References

Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Tony Espy, 2013-09-04