← Back to team overview

ubuntu-phone team mailing list archive

Re: Catching CPU run-aways on Touch

 

On Wed, Sep 4, 2013 at 5:06 PM, John Lea <john.lea@xxxxxxxxxxxxx> wrote:
> Hi All,  two questions about the proposal below:
>
> 1) How do we differentiate between background non-ui processes that are
> legitimately using 100%+ of the CPU for extended periods of time and runaway
> processes?
>
> e.g. We have a photo panorama app.  After a set of photos are taken, the
> camera app passes these photos to a seperate background process to async
> stitch the photos together into a seamless panorama.  This task takes say 5
> min to complete at 100% CPU to complete, so should run as a lowest priority
> background process regardless of which app is in focus until this task is
> complete. When the panorama is ready, this background process then hands the
> completed result back to the photo app.

That's indeed quite interesting and I don't think we are able to come
up with a meaningful heuristic for version 1. (Evan, please correct me
if I'm wrong). However, for anything user visible, ANR is the right
approach to identify apps that are using 100% CPU _and_ are
unresponsive while calculating.

>
> There are other legitimate use cases for why we might want background non-UI
> processes to run at 100%+ CPU usage for extended periods, how are these
> differentiated from the buggy processes? From my understanding we are
> proposing that app developers split any parts of their app that they need to
> run continually in the background regardless of app focus into a separate
> service.
>

Yup, that is what the lifecycle implies, and it is good software
design. We need to make that split as painless as possible and make it
straightforward to out-process complex calculations.

> 2) How do we prevent developers and ourselves from coming to rely on this
> auto-killer, and starting to think "it's not important to fix that runaway
> process bug because it will be caught by the process killer, I should work
> on something else instead"?
>
> Once this auto-killer is implemented will it not change developer's
> behaviour?  How do we prevent ourselves and app developers from starting to
> rely on this service instead of fixing the underlying bugs?
>

Well, we are still reporting issues and we can keep that as an option,
even in production. Apart from that: Yes, we are making the system
more robust against outliers in terms of resource usage. From a user's
perspective, the result is the same: A bug-free service or a system
that can handle outliers gracefully.

Cheers,

  Thomas

> cheers,
> John
>
>
>
> On 04/09/13 10:49, Evan Dandrea wrote:
>>
>> Hi folks,
>>
>> In another discussion, James Hunt raised the possibility of
>> periodically checking for runaway processes on Touch, killing those
>> consuming 100% CPU while creating a report to be sent to
>> https://errors.ubuntu.com.
>>
>> I've summarised the key points of that discussion here into a
>> proposal. The hope of this is that it gives everyone a chance to
>> provide input.
>>
>> == Examples ==
>>
>> There are a few examples of this problem biting us already.
>>
>> The original bug James ran into was:
>> https://bugs.launchpad.net/ubuntu/+source/bluetooth-touch/+bug/1217865
>>
>> Martin Pitt also raised one where two rogue system service processes
>> constantly used 150% CPU (i. e. 1.5 cores):
>> https://launchpad.net/bugs/1188404
>>
>> A few weeks ago there was a nasty timing bug which caused ueventd to
>> use 100% CPU:
>> https://bugs.launchpad.net/touch-preview-images/+bug/1190792
>>
>> Whoopsie also had a memory corruption bug which caused 100% CPU usage
>> around the same time as the ueventd bug:
>> https://bugs.launchpad.net/touch-preview-images/+bug/1211417
>>
>> Note that this is not really about power consumption. Colin King has
>> done analysis of power consumption on Touch devices and the biggest
>> bang for the buck is ensuring that sensors are turned off when they
>> are not needed, not minimising CPU usage. Instead, please consider
>> this proposal an attempt to better ensure the stability and
>> performance of Touch systems out in the wild.
>>
>> == Implementation ==
>>
>> We will enable the sampling and reporting of high CPU usage in
>> background processes on Touch devices when the device is not in
>> developer mode.
>>
>> Foreground processes will be ignored by this check. They will instead
>> be handled by an "application not responding" (ANR) implementation in
>> Mir. They will be allowed to use 100% CPU unless they block the UI
>> thread for an unreasonable amount of time.
>>
>> With the application lifecycle work, background applications will be
>> suspended and get no CPU time at all, so this check will only apply to
>> system processes.
>>
>> Each background process will be periodically sampled for its CPU
>> usage. If the process is using a large amount of CPU consistently
>> across several of these samplings, it will be killed and an apport
>> report will be created.
>>
>> An outstanding question is what the threshold should be for high CPU
>> usage.
>>
>> == Where will this check live? ==
>>
>> It has been suggested that the task of periodically checking for
>> runaway processes live inside a long-running and lightweight C
>> process. Whoopsie was suggested as a potential candidate.
>>
>> libprocps was raised as potentially helpful, but James pointed out
>> that CPU percentage needed to be calculated by the caller, so another
>> approach may prove easier.
>>
>> == How will we group reports of the same underlying problem? ==
>>
>> https://errors.ubuntu.com will need to receive a string that
>> represents the problem (a signature) to which this instance of a
>> runaway process belongs. This lets the website group together the
>> instances of a problem onto a single page and increment the count for
>> the problem on the front page leaderboard.
>>
>> Whoopsie, or whatever process holds this check, will use the ptrace
>> system call to generate a stack trace of the runaway process which
>> apport can then use to generate a crash signature:
>>
>> http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1199
>>
>> Martin suggested we could do three stack traces each 1 s ± <random
>> interval> apart, and then chop away the differing part at the top, so
>> that we only keep the common bit.
>>
>> Since we would be generating multiple stack traces, we cannot just
>> build the report and stack trace through the traditional means of
>> triggering apport kernel core pipe handler by sending SIGABRT.
>>
>> == Bringing the check to the Ubuntu desktop ==
>>
>> It was suggested that we could also have this check on the Ubuntu
>> desktop, but it was quickly pointed out that great care would need to
>> be taken to prevent reporting when gcc or Firefox uses 100% CPU.
>>
>> This would be particularly annoying since the desktop currently
>> presents a dialog whenever an error occurs. There are plans underway
>> to group errors that do not need your immediate attention
>> (non-application crashes, e.g. package installation failures) into a
>> single dialog with the next error that does require your attention
>> (Firefox crashing); however, a quicker solution would be to only
>> report these desktop runaway processes on systems that have automatic
>> error reporting enabled.
>>
>> We could then create a blacklist of processes that are known to be
>> intensive but safe using the data gathered from Touch and automatic
>> reporting systems and eventually bring reporting of runaway processes
>> to all Ubuntu systems (save servers).
>>
>> A whitelist was considered, but determined to not save us from
>> problems like the ueventd bug.
>>
>> Thanks,
>> Evan
>>
>
>
> --
> John Lea | Ubuntu Desktop User Experience Lead
> Canonical  www.canonical.com | Ubuntu  www.ubuntu.com
> 5th Floor, Blue Fin Building, 110 Southwark Street, London SE1 0SU
> Tel: +44 (0) 20 7630 2415 | Email: john.lea@xxxxxxxxxxxxx
>
>
>
> --
> Mailing list: https://launchpad.net/~ubuntu-phone
> Post to     : ubuntu-phone@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~ubuntu-phone
> More help   : https://help.launchpad.net/ListHelp


References