ubuntu-phone team mailing list archive

Thread
Date

Re: Catching CPU run-aways on Touch

To: Evan Dandrea <ev@xxxxxxxxxx>
From: Ted Gould <ted@xxxxxxxxxx>
Date: Wed, 04 Sep 2013 10:22:59 -0500
Cc: Brian Murray <brian@xxxxxxxxxxxxx>, Ubuntu Touch <ubuntu-phone@xxxxxxxxxxxxxxxxxxx>, Martin Pitt <martin.pitt@xxxxxxxxxx>, Dmitrijs Ledkovs <dmitrijs.ledkovs@xxxxxxxxxxxxx>, Steve Langasek <steve.langasek@xxxxxxxxxxxxx>, Adam Conrad <adconrad@xxxxxxxxxxxxx>, Oliver Grawert <ogra@xxxxxxxxxxxxx>
In-reply-to: <CAOe9oG73jPPC-yn23s9+80ACU09-7v16aeK8Ak61_ctU5za6Bw@mail.gmail.com>

On Wed, 2013-09-04 at 14:35 +0100, Evan Dandrea wrote:

> On 4 September 2013 12:25, Thomas Voß <thomas.voss@xxxxxxxxxxxxx> wrote:
> > +1, the respective grace/timeout period would need to be determined
> > from empirical data, too.
> 
> Agreed.

I think that it should also be per-service.  For instance, HUD does the
voice recognition that can have longer usages of CPU.  But we could
really tighten down URL dispatcher, it really shouldn't use much CPU at
all.

> >> With the application lifecycle work, background applications will be
> >> suspended and get no CPU time at all, so this check will only apply to
> >> system processes.
> >>
> >
> > True, but to determine the CPU percentage, we would need to have the
> > CPU usage of all processes for a certain amount of time available.
> > That is, essentially parsing all of proc iiuc. I'm hopefully wrong,
> > but if not, could we resort to an approach that just considers the
> > per-process user and system CPU time consumed in a given time
> > interval?
> 
> Yes, I didn't mean to imply that we filter them out from any
> calculation, but rather that we do not consider reporting them because
> they'll either be foregrounded or suspended.
> 
> > Yup, but we should start over with measuring before we start
> > classifying CPU usage.
> 
> Yes, definitely. It absolutely makes sense to run this in a
> measure-only mode before we flip on the out-of-control killer.

We should probably start with reporting bugs, and then next step start
killing.  "Would have been killed" bugs might be an interesting
metric :-)

> > I would rather want to keep it out of whoopsie and integrate it with
> > the component implementing the lifecycle policy for two reasons.
> > However, if we are only considering to monitor system/session
> > services, it would be ok to start over with whoopsie. However, I would
> > prefer an implementation that is easily reusable by other components
> > (Mir/Unity8).
> 
> Yes, and I believe this was suggested in the prior discussion by
> Steve. I forgot to include it in the summary; sorry.
> 
> I am definitely behind this living in the lifecycle policy system.
> While whoopsie is long-running and does operate in the area of error
> reporting, it does not already have code to poke at the brains of
> processes. That's all handled by apport via the kernel core pipe. So
> it felt like scope creep to me.
> 
> > Indeed. I think we would need to consider visibility (in terms of UI)
> > and ANR in the policy and the component implementing the lifecycle
> > policy, i.e., Mir and Unity8, is a better place to implement the
> > behavior.
> 
> Definitely, though care still would need to be taken for things like
> GCC on the desktop.

It seems to me for all of these long running services the "manager" of
them is Upstart.  It restarts them if they crash or do other stupid
things, and it knows whether they're running.  This seems roughly like
respawn limits[1], where they're per-task and can be configured to
create different results.

Also, it seems that this should work within those limits, we should try
to restart the service to see if it solves the problem.  But keep it on
a shorter leash for the second time around.

To give people something to attack more specifically, I'll say this.  We
should add a line to Upstart job configs that looks like this:

        cpu limit [CPU Percentage] [seconds]

Then we can have a small upstart-bridge-like process that watches
upstart for started, stopped and added jobs to ensure that they're on
the naughty/nice list and that they behave within those limits.

Ted

PS - Same for RAM?

[1] http://upstart.ubuntu.com/cookbook/#respawn-limit

Attachment: signature.asc
Description: This is a digitally signed message part

Follow ups

Re: Catching CPU run-aways on Touch
From: Steve Langasek, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04

References

Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Thomas Voß, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04