← Back to team overview

openstack team mailing list archive

Re: Scheduler issues in folsom

 

The retry scheduler is NOT meant to be a workaround for this. It sounds like
the ram filter is not working properly somehow. Have you changed the setting
for ram_allocation_ratio? It defaults to 1.5 allowing overallocation, but in
your case you may want 1.0.

I would be using the following two config options to achieve what you want:
compute_fill_first_cost_fn_weight=1.0
ram_allocation_ratio=1.0

If you are using the settings above, then the scheduler should be using up the
resources on the node it schedules to until it fills up the available ram and
then moving on to the next node. If this is not occurring then you have uncovered
some sort of bug.

Vish
On Oct 30, 2012, at 9:21 AM, Jonathan Proulx <jon@xxxxxxxxxxxxx> wrote:

> Hi All,
> 
> I'm having what I consider serious issues with teh scheduler in
> Folsom.  It seems to relate to the introdution of threading in the
> scheduler.
> 
> For a number of local reason we prefer to have instances start on the
> compute node with the least amount of free RAM that is still enough to
> satisfy the request which is the reverse of the default policy of
> scheduling on the system with the most free RAM.  I'm fairly certain
> the smae behavior would be seen with that policy as well, and any
> other policy that results in a "best" choice for scheduling the next
> instance. 
> 
> We have work loads that start hundreds of instances or the same image
> and there are plans on scaling this to thousands.  What I'm seeing is
> somehting like this:
> 
> * user submits API request for 300 instances
> * scheduler puts them all on one node
> * retry schedule kicks in at some point for the 276 that don't fit
> * those 276 are all scheduled on the next "best" node
> * retry cycle repeats with the 252 that don't fit there
> 
> I'm not clear exactly where the RetryScheduler in serts itself (I
> should probably read it) but the first compute node is very overloaded
> handling start up request which results in a fair number of instances
> entering "ERROR" state rather than rescheduling (so not all 276
> actually make it to the next round) and the whole process it painfully
> slow.  In the end we are lucky to see 50% of the requested instances
> actually make it into Active state (and then only becasue we increased
> scheduler_max_attempts).
> 
> Is that really how it's supposed to work?  With the introduction of
> the RetryScheduler as a fix for the scheduling race condition I think
> it is, but it is a pretty bad solution for me, unless I'm missing
> something, am I?  wouln't be the first time... 
> 
> For now I'm working around this by using the ChanceScheduler
> (compute_scheduler_driver=nova.scheduler.chance.ChanceScheduler) so
> the scheduler threads don't pick a "best" node.  This is orders of
> magnitude faster and consistantly successful in my tests.  It is not
> ideal for us as we have a small minority of ciompute nodes with twice
> the memory capacity of our standard nodes and would prefer to keep
> those available for some of our extra large memory flavors and we'd
> also liek to minimize memory fragmentation on the standard sized nodes
> for similar reasons.
> 
> -Jon
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp



Follow ups

References