ubuntu-phone team mailing list archive

Thread
Date

Re: Catching CPU run-aways on Touch

To: Evan Dandrea <ev@xxxxxxxxxx>
From: Thomas Voß <thomas.voss@xxxxxxxxxxxxx>
Date: Wed, 4 Sep 2013 13:25:09 +0200
Cc: Oliver Grawert <ogra@xxxxxxxxxxxxx>, Ubuntu Touch <ubuntu-phone@xxxxxxxxxxxxxxxxxxx>, Martin Pitt <martin.pitt@xxxxxxxxxx>, Dmitrijs Ledkovs <dmitrijs.ledkovs@xxxxxxxxxxxxx>, Steve Langasek <steve.langasek@xxxxxxxxxxxxx>, Adam Conrad <adconrad@xxxxxxxxxxxxx>, Brian Murray <brian@xxxxxxxxxxxxx>
In-reply-to: <CAOe9oG7b7ts2dyyYtgMt=3=5=a=Vw9vEiVk5oOWWXqNmZE=YVg@mail.gmail.com>

Hey Evan,

thanks for the summary and for bringing up the topic. A few comments inline:

On Wed, Sep 4, 2013 at 11:49 AM, Evan Dandrea <ev@xxxxxxxxxx> wrote:
> Hi folks,
>
> In another discussion, James Hunt raised the possibility of
> periodically checking for runaway processes on Touch, killing those
> consuming 100% CPU while creating a report to be sent to
> https://errors.ubuntu.com.
>
> I've summarised the key points of that discussion here into a
> proposal. The hope of this is that it gives everyone a chance to
> provide input.
>
> == Examples ==
>
> There are a few examples of this problem biting us already.
>
> The original bug James ran into was:
> https://bugs.launchpad.net/ubuntu/+source/bluetooth-touch/+bug/1217865
>
> Martin Pitt also raised one where two rogue system service processes
> constantly used 150% CPU (i. e. 1.5 cores):
> https://launchpad.net/bugs/1188404
>
> A few weeks ago there was a nasty timing bug which caused ueventd to
> use 100% CPU:
> https://bugs.launchpad.net/touch-preview-images/+bug/1190792
>
> Whoopsie also had a memory corruption bug which caused 100% CPU usage
> around the same time as the ueventd bug:
> https://bugs.launchpad.net/touch-preview-images/+bug/1211417
>
> Note that this is not really about power consumption. Colin King has
> done analysis of power consumption on Touch devices and the biggest
> bang for the buck is ensuring that sensors are turned off when they
> are not needed, not minimising CPU usage. Instead, please consider
> this proposal an attempt to better ensure the stability and
> performance of Touch systems out in the wild.
>
> == Implementation ==
>
> We will enable the sampling and reporting of high CPU usage in
> background processes on Touch devices when the device is not in
> developer mode.
>
> Foreground processes will be ignored by this check. They will instead
> be handled by an "application not responding" (ANR) implementation in
> Mir. They will be allowed to use 100% CPU unless they block the UI
> thread for an unreasonable amount of time.
>

+1, the respective grace/timeout period would need to be determined
from empirical data, too.

> With the application lifecycle work, background applications will be
> suspended and get no CPU time at all, so this check will only apply to
> system processes.
>

True, but to determine the CPU percentage, we would need to have the
CPU usage of all processes for a certain amount of time available.
That is, essentially parsing all of proc iiuc. I'm hopefully wrong,
but if not, could we resort to an approach that just considers the
per-process user and system CPU time consumed in a given time
interval?

> Each background process will be periodically sampled for its CPU
> usage. If the process is using a large amount of CPU consistently
> across several of these samplings, it will be killed and an apport
> report will be created.
>
> An outstanding question is what the threshold should be for high CPU usage.
>

Yup, but we should start over with measuring before we start
classifying CPU usage.

> == Where will this check live? ==
>
> It has been suggested that the task of periodically checking for
> runaway processes live inside a long-running and lightweight C
> process. Whoopsie was suggested as a potential candidate.
>

I would rather want to keep it out of whoopsie and integrate it with
the component implementing the lifecycle policy for two reasons.
However, if we are only considering to monitor system/session
services, it would be ok to start over with whoopsie. However, I would
prefer an implementation that is easily reusable by other components
(Mir/Unity8).

> libprocps was raised as potentially helpful, but James pointed out
> that CPU percentage needed to be calculated by the caller, so another
> approach may prove easier.
>

I briefly scanned through the code and I would rather think we should
come up with our own API that we then implement with the help of
libprocps (or libgtop for example).

> == How will we group reports of the same underlying problem? ==
>
> https://errors.ubuntu.com will need to receive a string that
> represents the problem (a signature) to which this instance of a
> runaway process belongs. This lets the website group together the
> instances of a problem onto a single page and increment the count for
> the problem on the front page leaderboard.
>
> Whoopsie, or whatever process holds this check, will use the ptrace
> system call to generate a stack trace of the runaway process which
> apport can then use to generate a crash signature:
> http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1199
>
> Martin suggested we could do three stack traces each 1 s ± <random
> interval> apart, and then chop away the differing part at the top, so
> that we only keep the common bit.
>
> Since we would be generating multiple stack traces, we cannot just
> build the report and stack trace through the traditional means of
> triggering apport kernel core pipe handler by sending SIGABRT.
>
> == Bringing the check to the Ubuntu desktop ==
>
> It was suggested that we could also have this check on the Ubuntu
> desktop, but it was quickly pointed out that great care would need to
> be taken to prevent reporting when gcc or Firefox uses 100% CPU.
>

Indeed. I think we would need to consider visibility (in terms of UI)
and ANR in the policy and the component implementing the lifecycle
policy, i.e., Mir and Unity8, is a better place to implement the
behavior.

Cheers,

  Thomas

> This would be particularly annoying since the desktop currently
> presents a dialog whenever an error occurs. There are plans underway
> to group errors that do not need your immediate attention
> (non-application crashes, e.g. package installation failures) into a
> single dialog with the next error that does require your attention
> (Firefox crashing); however, a quicker solution would be to only
> report these desktop runaway processes on systems that have automatic
> error reporting enabled.
>
> We could then create a blacklist of processes that are known to be
> intensive but safe using the data gathered from Touch and automatic
> reporting systems and eventually bring reporting of runaway processes
> to all Ubuntu systems (save servers).
>
> A whitelist was considered, but determined to not save us from
> problems like the ueventd bug.
>
> Thanks,
> Evan
>
> --
> Mailing list: https://launchpad.net/~ubuntu-phone
> Post to     : ubuntu-phone@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~ubuntu-phone
> More help   : https://help.launchpad.net/ListHelp

Follow ups

Re: Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04
Re: Catching CPU run-aways on Touch
From: John Lenton, 2013-09-04

References

Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04