openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #17989
Scheduler issues in folsom
Hi All,
I'm having what I consider serious issues with teh scheduler in
Folsom. It seems to relate to the introdution of threading in the
scheduler.
For a number of local reason we prefer to have instances start on the
compute node with the least amount of free RAM that is still enough to
satisfy the request which is the reverse of the default policy of
scheduling on the system with the most free RAM. I'm fairly certain
the smae behavior would be seen with that policy as well, and any
other policy that results in a "best" choice for scheduling the next
instance.
We have work loads that start hundreds of instances or the same image
and there are plans on scaling this to thousands. What I'm seeing is
somehting like this:
* user submits API request for 300 instances
* scheduler puts them all on one node
* retry schedule kicks in at some point for the 276 that don't fit
* those 276 are all scheduled on the next "best" node
* retry cycle repeats with the 252 that don't fit there
I'm not clear exactly where the RetryScheduler in serts itself (I
should probably read it) but the first compute node is very overloaded
handling start up request which results in a fair number of instances
entering "ERROR" state rather than rescheduling (so not all 276
actually make it to the next round) and the whole process it painfully
slow. In the end we are lucky to see 50% of the requested instances
actually make it into Active state (and then only becasue we increased
scheduler_max_attempts).
Is that really how it's supposed to work? With the introduction of
the RetryScheduler as a fix for the scheduling race condition I think
it is, but it is a pretty bad solution for me, unless I'm missing
something, am I? wouln't be the first time...
For now I'm working around this by using the ChanceScheduler
(compute_scheduler_driver=nova.scheduler.chance.ChanceScheduler) so
the scheduler threads don't pick a "best" node. This is orders of
magnitude faster and consistantly successful in my tests. It is not
ideal for us as we have a small minority of ciompute nodes with twice
the memory capacity of our standard nodes and would prefer to keep
those available for some of our extra large memory flavors and we'd
also liek to minimize memory fragmentation on the standard sized nodes
for similar reasons.
-Jon
Follow ups