← Back to team overview

ubuntu-phone team mailing list archive

Re: Catching CPU run-aways on Touch

 

On Wed, 2013-09-04 at 11:01 -0700, Steve Langasek wrote:

> On Wed, Sep 04, 2013 at 10:22:59AM -0500, Ted Gould wrote:
> > It seems to me for all of these long running services the "manager" of
> > them is Upstart.  It restarts them if they crash or do other stupid
> > things, and it knows whether they're running.  This seems roughly like
> > respawn limits[1], where they're per-task and can be configured to
> > create different results.
> 
> > Also, it seems that this should work within those limits, we should try
> > to restart the service to see if it solves the problem.  But keep it on
> > a shorter leash for the second time around.
> 
> > To give people something to attack more specifically, I'll say this.  We
> > should add a line to Upstart job configs that looks like this:
> 
> >         cpu limit [CPU Percentage] [seconds]
> 
> > Then we can have a small upstart-bridge-like process that watches
> > upstart for started, stopped and added jobs to ensure that they're on
> > the naughty/nice list and that they behave within those limits.
> 
> upstart already supports setting kernel ulimits for jobs; through ulimits
> you can already set "max CPU for the life of the process" and "max memory
> per process".  You can also set the realtime priority of a process.  You
> can't set a max *percentage* of CPU usage for the job, or max memory usage
> for the set of processes spawned by the job; both of these capabilities will
> arrive with cgroup support.
> 
> However, in all of the above cases we're talking about *limiting* the CPU
> usage, not *measuring* it.  If the desired semantics are to measure the
> process's CPU usage and report on it / *optionally* kill the process, I
> don't think that's a reasonable fit for upstart.  It makes sense for upstart
> to apply cgroups to processes upon request, allowing the kernel to limit the
> amount of CPU the job gets access to... but then by definition no such
> process is ever a "runaway" because it's kept on a leash, so you don't
> actually get any useful information this way about which processes are buggy
> and should be fixed.  If we care about identifying and fixing misbehaving
> processes, rather than just limiting the damage, that should be handled
> outside of upstart.


I wasn't intending to say "upstart should do it" more that we should put
it in the Upstart job configuration and use that as our basis.  That's
what I was trying to say with "upstart-bridge-like" thing in that it'd
get the information from Upstart (including dynamic job creation,
addition, etc) but still do the tracking and reaction on its own.  I'd
expect that then it'd stop/start/restart jobs using Upstart as well.

I do think we should look at using ulimits as well, but I think that's
an aside for this thread.

Ted

Attachment: signature.asc
Description: This is a digitally signed message part


Follow ups

References