yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1438238] Re: Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Thierry Carrez <thierry.carrez+lp@xxxxxxxxx>
Date: Tue, 21 Apr 2015 08:43:15 -0000
Reply-to: Bug 1438238 <1438238@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Also affects: nova/kilo
   Importance: Undecided
       Status: New

** Changed in: nova/kilo
    Milestone: None => kilo-rc2

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1438238

Title:
  Several concurent scheduling requests for CPU pinning may fail due to
  racy host_state handling

Status in OpenStack Compute (Nova):
  Fix Committed
Status in OpenStack Compute (nova) kilo series:
  New

Bug description:
  The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
   

  015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception
  during message handling: Cannot pin/unpin cpus [4] from the following
  pinned set [3, 4, 5, 6, 7, 8, 9]

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  Traceback (most recent call last):

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "/usr/lib64/python2.7/site-
  packages/oslo/messaging/rpc/dispatcher.py", line 134, in
  _dispatch_and_reply

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  incoming.message))

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "/usr/lib64/python2.7/site-
  packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  return self._do_dispatch(endpoint, method, ctxt, args)

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "/usr/lib64/python2.7/site-
  packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  result = getattr(endpoint, method)(ctxt, **new_args)

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "/usr/lib64/python2.7/site-
  packages/oslo/messaging/rpc/server.py", line 139, in inner

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  return func(*args, **kwargs)

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py",
  line 86, in select_destinations

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-
  packages/nova/scheduler/filter_scheduler.py", line 80, in
  select_destinations

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-
  packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-
  packages/nova/scheduler/host_manager.py", line 266, in
  consume_from_instance

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line
  1472, in get_host_numa_usage_from_instance

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line
  1344, in numa_usage_from_instances

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line
  91, in pin_cpus

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
  CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned
  set [3, 4, 5, 6, 7, 8, 9]

  2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher

  What is likely happening is:

  * nova-scheduler is handling several RPC calls to select_destinations
  at the same time, in multiple greenthreads

  * greenthread 1 runs the NUMATopologyFilter and selects a cpu on a
  particular compute node, updating host_state.instance_numa_topology

  * greenthread 1 then blocks for some reason

  * greenthread 2 runs the NUMATopologyFilter and selects the same cpu
  on the same compute node, updating host_state.instance_numa_topology.
  This also seems like an issue if a different cpu was selected, as it
  would be overwriting the instance_numa_topology selected by
  greenthread 1.

  * greenthread 2 then blocks for some reason

  * greenthread 1 gets scheduled and calls consume_from_instance, which
  consumes the numa resources based on what is in
  host_state.instance_numa_topology

  *  greenthread 1 completes the scheduling operation

  * greenthread 2 gets scheduled and calls consume_from_instance, which
  consumes the numa resources based on what is in
  host_state.instance_numa_topology - since the resources were already
  consumed by greenthread 1, we get the exception above

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1438238/+subscriptions

References

[Bug 1438238] [NEW] Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling
From: Nikola Đipanov, 2015-03-30