ubuntu-phone team mailing list archive

Thread
Date

Re: Catching CPU run-aways on Touch

To: Evan Dandrea <ev@xxxxxxxxxx>
From: John Lea <john.lea@xxxxxxxxxxxxx>
Date: Wed, 04 Sep 2013 16:06:33 +0100
Cc: Oliver Grawert <ogra@xxxxxxxxxxxxx>, ubuntu-phone@xxxxxxxxxxxxxxxxxxx, Martin Pitt <martin.pitt@xxxxxxxxxx>, Dmitrijs Ledkovs <dmitrijs.ledkovs@xxxxxxxxxxxxx>, Steve Langasek <steve.langasek@xxxxxxxxxxxxx>, Adam Conrad <adconrad@xxxxxxxxxxxxx>, Brian Murray <brian@xxxxxxxxxxxxx>
In-reply-to: <CAOe9oG7b7ts2dyyYtgMt=3=5=a=Vw9vEiVk5oOWWXqNmZE=YVg@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130221 Thunderbird/17.0.3

Hi All,  two questions about the proposal below:

1) How do we differentiate between background non-ui processes that arelegitimately using 100%+ of the CPU for extended periods of time andrunaway processes?

e.g. We have a photo panorama app. After a set of photos are taken, thecamera app passes these photos to a seperate background process to asyncstitch the photos together into a seamless panorama. This task takessay 5 min to complete at 100% CPU to complete, so should run as a lowestpriority background process regardless of which app is in focus untilthis task is complete. When the panorama is ready, this backgroundprocess then hands the completed result back to the photo app.

There are other legitimate use cases for why we might want backgroundnon-UI processes to run at 100%+ CPU usage for extended periods, how arethese differentiated from the buggy processes? From my understanding weare proposing that app developers split any parts of their app that theyneed to run continually in the background regardless of app focus into aseparate service.

2) How do we prevent developers and ourselves from coming to rely onthis auto-killer, and starting to think "it's not important to fix thatrunaway process bug because it will be caught by the process killer, Ishould work on something else instead"?

Once this auto-killer is implemented will it not change developer'sbehaviour? How do we prevent ourselves and app developers from startingto rely on this service instead of fixing the underlying bugs?


cheers,
John


On 04/09/13 10:49, Evan Dandrea wrote:

Hi folks,

In another discussion, James Hunt raised the possibility of
periodically checking for runaway processes on Touch, killing those
consuming 100% CPU while creating a report to be sent to
https://errors.ubuntu.com.

I've summarised the key points of that discussion here into a
proposal. The hope of this is that it gives everyone a chance to
provide input.

== Examples ==

There are a few examples of this problem biting us already.

The original bug James ran into was:
https://bugs.launchpad.net/ubuntu/+source/bluetooth-touch/+bug/1217865

Martin Pitt also raised one where two rogue system service processes
constantly used 150% CPU (i. e. 1.5 cores):
https://launchpad.net/bugs/1188404

A few weeks ago there was a nasty timing bug which caused ueventd to
use 100% CPU:
https://bugs.launchpad.net/touch-preview-images/+bug/1190792

Whoopsie also had a memory corruption bug which caused 100% CPU usage
around the same time as the ueventd bug:
https://bugs.launchpad.net/touch-preview-images/+bug/1211417

Note that this is not really about power consumption. Colin King has
done analysis of power consumption on Touch devices and the biggest
bang for the buck is ensuring that sensors are turned off when they
are not needed, not minimising CPU usage. Instead, please consider
this proposal an attempt to better ensure the stability and
performance of Touch systems out in the wild.

== Implementation ==

We will enable the sampling and reporting of high CPU usage in
background processes on Touch devices when the device is not in
developer mode.

Foreground processes will be ignored by this check. They will instead
be handled by an "application not responding" (ANR) implementation in
Mir. They will be allowed to use 100% CPU unless they block the UI
thread for an unreasonable amount of time.

With the application lifecycle work, background applications will be
suspended and get no CPU time at all, so this check will only apply to
system processes.

Each background process will be periodically sampled for its CPU
usage. If the process is using a large amount of CPU consistently
across several of these samplings, it will be killed and an apport
report will be created.

An outstanding question is what the threshold should be for high CPU usage.

== Where will this check live? ==

It has been suggested that the task of periodically checking for
runaway processes live inside a long-running and lightweight C
process. Whoopsie was suggested as a potential candidate.

libprocps was raised as potentially helpful, but James pointed out
that CPU percentage needed to be calculated by the caller, so another
approach may prove easier.

== How will we group reports of the same underlying problem? ==

https://errors.ubuntu.com will need to receive a string that
represents the problem (a signature) to which this instance of a
runaway process belongs. This lets the website group together the
instances of a problem onto a single page and increment the count for
the problem on the front page leaderboard.

Whoopsie, or whatever process holds this check, will use the ptrace
system call to generate a stack trace of the runaway process which
apport can then use to generate a crash signature:
http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1199

Martin suggested we could do three stack traces each 1 s ± <random
interval> apart, and then chop away the differing part at the top, so
that we only keep the common bit.

Since we would be generating multiple stack traces, we cannot just
build the report and stack trace through the traditional means of
triggering apport kernel core pipe handler by sending SIGABRT.

== Bringing the check to the Ubuntu desktop ==

It was suggested that we could also have this check on the Ubuntu
desktop, but it was quickly pointed out that great care would need to
be taken to prevent reporting when gcc or Firefox uses 100% CPU.

This would be particularly annoying since the desktop currently
presents a dialog whenever an error occurs. There are plans underway
to group errors that do not need your immediate attention
(non-application crashes, e.g. package installation failures) into a
single dialog with the next error that does require your attention
(Firefox crashing); however, a quicker solution would be to only
report these desktop runaway processes on systems that have automatic
error reporting enabled.

We could then create a blacklist of processes that are known to be
intensive but safe using the data gathered from Touch and automatic
reporting systems and eventually bring reporting of runaway processes
to all Ubuntu systems (save servers).

A whitelist was considered, but determined to not save us from
problems like the ueventd bug.

Thanks,
Evan



--
John Lea | Ubuntu Desktop User Experience Lead
Canonical  www.canonical.com | Ubuntu  www.ubuntu.com
5th Floor, Blue Fin Building, 110 Southwark Street, London SE1 0SU
Tel: +44 (0) 20 7630 2415 | Email: john.lea@xxxxxxxxxxxxx

Follow ups

Re: Catching CPU run-aways on Touch
From: Steve Langasek, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Thomas Voß, 2013-09-04
Re: Catching CPU run-aways on Touch
From: Colin Ian King, 2013-09-04

References

Catching CPU run-aways on Touch
From: Evan Dandrea, 2013-09-04