yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1341420] [NEW] gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Robert Collins <1341420@xxxxxxxxxxxxxxxxxx>
Date: Mon, 14 Jul 2014 04:15:03 -0000
Reply-to: Bug 1341420 <1341420@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

There is a race between the scheduler in select_destinations, which
selects a set of hosts, and the nova compute manager, which claims
resources on those hosts when building the instance. The race is
particularly noticable with Ironic, where ever request will consume a
full host, but can turn up on libvirt etc too. Multiple schedulers will
likely exacerbate this too unless they are in a version of python with
randomised dictionary ordering, in which case they will make it better
:).


I've put https://review.openstack.org/106677 up to remove a comment which comes from before we introduced this race.

One mitigating aspect to the race in the filter scheduler _schedule
method attempts to randomly select hosts to avoid returning the same
host in repeated requests, but the default minimum set it selects from
is size 1 - so when heat requests a single instance, the same candidate
is chosen every time. Setting that number higher can avoid all
concurrent requests hitting the same host, but it will still be a race,
and still likely to fail fairly hard at near-capacity situations (e.g.
deploying all machines in a cluster with Ironic and Heat).

Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or 10
hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity on
each hypervisor. Then deploy a bunch of VMs one at a time but very close
together - e.g. use the python API to get cached keystone credentials,
and boot 5 in a loop.

If using Ironic you will want https://review.openstack.org/106676 to let
you see which host is being returned from the selection.

Possible fixes:
 - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well
 - reinstate actioning claims into the scheduler, allowing the audit to correct any claimed-but-not-started resource counts asynchronously
 - special case the retry behaviour if there are lots of resources available elsewhere in the cluster.

Stats wise, I just testing a 29 instance deployment with ironic and a
heat stack, with 45 machines to deploy onto (so 45 hosts in the
scheduler set) and 4 failed with this race - which means they recheduled
and failed 3 times each - or 12 cases of scheduler racing *at minimum*.

background chat

15:43 < lifeless> mikal: around? I need to sanity check something
15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
15:45 < mikal> lifeless: ok
15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
15:46 < lifeless> mikal: if you have ~5m
15:46 < mikal> Sure
15:46 < lifeless> so, symptoms
15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
15:47 < lifeless> this is with Ironic
15:47 < mikal> Sure
15:47 < lifeless> the failure on all the instances is the retry-three-times failure-of-death
15:47 < lifeless> what I believe is happening is this
15:48 < lifeless> the scheduler is allocating the same weighed list of hosts for requests that happen close enough together
15:49 < lifeless> and I believe its able to do that because the target hosts (from select_destinations) need to actually hit the compute node manager and have 
15:49 < lifeless>             with rt.instance_claim(context, instance, limits):    
15:49 < lifeless> happen in _build_and_run_instance
15:49 < lifeless> before the resource usage is assigned
15:49 < mikal> Is heat making 45 separate requests to the nova API?
15:49 < lifeless> eys
15:49 < lifeless> yes
15:49 < lifeless> thats the key difference
15:50 < lifeless> same flavour, same image
15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: Remove cell api overrides for lock and unlock  https://review.openstack.org/89487
15:50 < mikal> And you have enough quota for these instances, right?
15:50 < lifeless> yes
15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does seem worth filing a bug for
15:51 < lifeless> my theory is that there is enough time between select_destinations in the conductor, and _build_and_run_instance in compute for another request to come in the front door and be scheduled to the same host
15:51 < mikal> That seems possible to me
15:52 < lifeless> I have no idea right now about how to fix it (other than to have the resources provisionally allocated by the scheduler before it sends a reply), but I am guessing that might be contentious
15:52 < mikal> I can't instantly think of a fix though -- we've avoided queue like behaviour for scheduling
15:52 < mikal> How big is the clsuter compared with 45 instances?
15:52 < mikal> Is it approximately the same size as that?
15:52 < lifeless> (by provisionally allocated, I mean 'claim them and let the audit in 60 seconds fix it up if they are not actually used')
15:53 < lifeless> sorry, not sure what yoy mean by that last question
15:53 < mikal> So, if you have 45 ironic instances to schedule, and 45 identical machines to do it, then the probability of picking the same machine more than once to schedule on is very high
15:53 < mikal> Wehereas if you had 500 machines, it would be low
15:53 < lifeless> oh yes, all the hardware is homogeneous
15:54 < lifeless> we believe this is common in clouds :)
15:54 < mikal> And the cluster is sized at approximately 45 machines?
15:54 < lifeless> the cluster is 46 machines but one is down for maintenance
15:54 < lifeless> so 45 machines available to schedule onto.
15:54 < mikal> Its the size of the cluster compared to the size of the set of instances which I'm most interested in
15:54 < lifeless> However - and this is the interesting thing
15:54 < lifeless> I tried a heat stack of 20 machines.
15:54 < lifeless> same symptoms
15:54 < mikal> Yeah, that's like the worst possible case for this algorithm
15:54 < lifeless> about 30% failed due to scheduler retries.
15:54 < mikal> Hmmm
15:54 < mikal> That is unexpected to me
15:55 < lifeless> that is when I dived into the code.
15:55 < lifeless> the patch I pushed above will make it possible to see if my theory is correct
15:55 < mikal> you were going to file a bug, right?
15:56 < lifeless> I have the form open to file one with tasks on ironic and nova
15:56 < mikal> I vote you do that thing
15:56 < lifeless> seconded
15:56 < lifeless> I might copy this transcript in as well
15:57 < mikal> Works for me

** Affects: ironic
     Importance: Undecided
         Status: New

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: ironic scheduler

** Also affects: nova
   Importance: Undecided
       Status: New

** Tags added: ironic scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Bare Metal Provisioning Service (Ironic):
  New
Status in OpenStack Compute (Nova):
  New

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where ever request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  
  I've put https://review.openstack.org/106677 up to remove a comment which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track destination selection counts since the last refresh and weight hosts by that count as well
   - reinstate actioning claims into the scheduler, allowing the audit to correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
  15:46 < lifeless> mikal: if you have ~5m
  15:46 < mikal> Sure
  15:46 < lifeless> so, symptoms
  15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
  15:47 < lifeless> this is with Ironic
  15:47 < mikal> Sure
  15:47 < lifeless> the failure on all the instances is the retry-three-times failure-of-death
  15:47 < lifeless> what I believe is happening is this
  15:48 < lifeless> the scheduler is allocating the same weighed list of hosts for requests that happen close enough together
  15:49 < lifeless> and I believe its able to do that because the target hosts (from select_destinations) need to actually hit the compute node manager and have 
  15:49 < lifeless>             with rt.instance_claim(context, instance, limits):    
  15:49 < lifeless> happen in _build_and_run_instance
  15:49 < lifeless> before the resource usage is assigned
  15:49 < mikal> Is heat making 45 separate requests to the nova API?
  15:49 < lifeless> eys
  15:49 < lifeless> yes
  15:49 < lifeless> thats the key difference
  15:50 < lifeless> same flavour, same image
  15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: Remove cell api overrides for lock and unlock  https://review.openstack.org/89487
  15:50 < mikal> And you have enough quota for these instances, right?
  15:50 < lifeless> yes
  15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does seem worth filing a bug for
  15:51 < lifeless> my theory is that there is enough time between select_destinations in the conductor, and _build_and_run_instance in compute for another request to come in the front door and be scheduled to the same host
  15:51 < mikal> That seems possible to me
  15:52 < lifeless> I have no idea right now about how to fix it (other than to have the resources provisionally allocated by the scheduler before it sends a reply), but I am guessing that might be contentious
  15:52 < mikal> I can't instantly think of a fix though -- we've avoided queue like behaviour for scheduling
  15:52 < mikal> How big is the clsuter compared with 45 instances?
  15:52 < mikal> Is it approximately the same size as that?
  15:52 < lifeless> (by provisionally allocated, I mean 'claim them and let the audit in 60 seconds fix it up if they are not actually used')
  15:53 < lifeless> sorry, not sure what yoy mean by that last question
  15:53 < mikal> So, if you have 45 ironic instances to schedule, and 45 identical machines to do it, then the probability of picking the same machine more than once to schedule on is very high
  15:53 < mikal> Wehereas if you had 500 machines, it would be low
  15:53 < lifeless> oh yes, all the hardware is homogeneous
  15:54 < lifeless> we believe this is common in clouds :)
  15:54 < mikal> And the cluster is sized at approximately 45 machines?
  15:54 < lifeless> the cluster is 46 machines but one is down for maintenance
  15:54 < lifeless> so 45 machines available to schedule onto.
  15:54 < mikal> Its the size of the cluster compared to the size of the set of instances which I'm most interested in
  15:54 < lifeless> However - and this is the interesting thing
  15:54 < lifeless> I tried a heat stack of 20 machines.
  15:54 < lifeless> same symptoms
  15:54 < mikal> Yeah, that's like the worst possible case for this algorithm
  15:54 < lifeless> about 30% failed due to scheduler retries.
  15:54 < mikal> Hmmm
  15:54 < mikal> That is unexpected to me
  15:55 < lifeless> that is when I dived into the code.
  15:55 < lifeless> the patch I pushed above will make it possible to see if my theory is correct
  15:55 < mikal> you were going to file a bug, right?
  15:56 < lifeless> I have the form open to file one with tasks on ironic and nova
  15:56 < mikal> I vote you do that thing
  15:56 < lifeless> seconded
  15:56 < lifeless> I might copy this transcript in as well
  15:57 < mikal> Works for me

To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/1341420/+subscriptions

Follow ups

[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Alex Schultz, 2017-09-22
[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Dan Smith, 2017-02-03
[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Vasyl Saienko, 2017-02-03
[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Sylvain Bauza, 2016-04-18
[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: James Slagle, 2016-03-08
[Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Michael Davies, 2015-06-02
[Bug 1341420] [NEW] gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Robert Collins, 2014-07-14

References

[Bug 1341420] [NEW] gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit
From: Robert Collins, 2014-07-14