← Back to team overview

openstack team mailing list archive

Re: Scheduler issues in folsom

 

Hi Jonathan,

If I understand correctly, that bug is about multiple scheduler
instances(processes) doing scheduler at the same time.  When compute
node found itself unable to fulfil a create_instance request, it'll
resend the request back to scheduler (max_retry is to avoid endless
retry). From your description, I only see one scheduler.  And you are
right, even memory may have some issue but about cpu_allocation_ratio
should have limit scheduler to put instances with more vCPUs then
pCPUs.  What openstack package are you using?

On Wed, Oct 31, 2012 at 11:41 PM, Jonathan Proulx <jon@xxxxxxxxxxxxx> wrote:
> Hi All
>
> While the RetryScheduler may not have been designed specifically to
> fix this issue https://bugs.launchpad.net/nova/+bug/1011852 suggests
> that it is meant to fix it, well if "it" is a scheduler race condition
> which is my suspicion.
>
> This is my current scheduler config which gives the failure mode I describe:
>
> scheduler_available_filters=nova.scheduler.filters.standard_filters
> scheduler_default_filters=AvailabilityZoneFilter,RamFilter,CoreFilter,ComputeFilte
> r,RetryFilter
> scheduler_max_attempts=30
> least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn
> compute_fill_first_cost_fn_weight=1.0
> cpu_allocation_ratio=1.0
> ram_allocation_ratio=1.0
>
> I'm running the scheduler and api server on a single controller host
> and it's pretty consistent about scheduling >hundred  instances per
> node at first then iteratively rescheduling them elsewhere when
> presented with either an single API request to start many instances
> (using euca2ools) or a shell loop around nova boot to generate one api
> request per server.
>
> the cpu_allocation ratio should limit the scheduler to 24 instances
> per compute node regardless how how it's calculating memory, so while
> I talked a lot about memory allocation as a motivation it is more
> frequent for cpu to actually be the limiting factor in my deployment
> and it certainly should.
>
> And yet after attempting to launch 200 m1.tiny instances:
>
> root@nimbus-0:~# nova-manage service describe_resource nova-23
> 2012-10-31 11:17:56
> HOST                              PROJECT     cpu mem(mb)     hdd
> nova-23         (total)                        24   48295     882
> nova-23         (used_now)                    107   56832      30
> nova-23         (used_max)                    107   56320      30
> nova-23                  98333a1a28e746fa8c629c83a818ad57     106
> 54272       0
> nova-23                  3008a142e9524f7295b06ea811908f93       1
> 2048      30
>
> eventually those bleed off to other systems though not entirely
>
> 2012-10-31 11:29:41
> HOST                              PROJECT     cpu mem(mb)     hdd
> nova-23         (total)                        24   48295     882
> nova-23         (used_now)                     43   24064      30
> nova-23         (used_max)                     43   23552      30
> nova-23                  98333a1a28e746fa8c629c83a818ad57      42
> 21504       0
> nova-23                  3008a142e9524f7295b06ea811908f93       1
> 2048      30
>
> at this point 12min later out of 200 instances 168 are active 22 are
> errored and 10 are still "building".  Notably only 23 actual VMs are
> running on "nova-23":
>
> root@nova-23:~# virsh list|grep instance |wc -l
> 23
>
> So that's what I see perhaps my assumptions about why I'm seeing it
> are incorrect
>
> Thanks,
> -Jon



-- 
Regards
Huang Zhiteng


Follow ups

References