yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #32149
[Bug 1438238] Re: Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling
** Also affects: nova/kilo
Importance: Undecided
Status: New
** Changed in: nova/kilo
Milestone: None => kilo-rc2
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1438238
Title:
Several concurent scheduling requests for CPU pinning may fail due to
racy host_state handling
Status in OpenStack Compute (Nova):
Fix Committed
Status in OpenStack Compute (nova) kilo series:
New
Bug description:
The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception
during message handling: Cannot pin/unpin cpus [4] from the following
pinned set [3, 4, 5, 6, 7, 8, 9]
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
Traceback (most recent call last):
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "/usr/lib64/python2.7/site-
packages/oslo/messaging/rpc/dispatcher.py", line 134, in
_dispatch_and_reply
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
incoming.message))
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "/usr/lib64/python2.7/site-
packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
return self._do_dispatch(endpoint, method, ctxt, args)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "/usr/lib64/python2.7/site-
packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
result = getattr(endpoint, method)(ctxt, **new_args)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "/usr/lib64/python2.7/site-
packages/oslo/messaging/rpc/server.py", line 139, in inner
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
return func(*args, **kwargs)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py",
line 86, in select_destinations
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-
packages/nova/scheduler/filter_scheduler.py", line 80, in
select_destinations
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-
packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-
packages/nova/scheduler/host_manager.py", line 266, in
consume_from_instance
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line
1472, in get_host_numa_usage_from_instance
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line
1344, in numa_usage_from_instances
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line
91, in pin_cpus
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned
set [3, 4, 5, 6, 7, 8, 9]
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
What is likely happening is:
* nova-scheduler is handling several RPC calls to select_destinations
at the same time, in multiple greenthreads
* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a
particular compute node, updating host_state.instance_numa_topology
* greenthread 1 then blocks for some reason
* greenthread 2 runs the NUMATopologyFilter and selects the same cpu
on the same compute node, updating host_state.instance_numa_topology.
This also seems like an issue if a different cpu was selected, as it
would be overwriting the instance_numa_topology selected by
greenthread 1.
* greenthread 2 then blocks for some reason
* greenthread 1 gets scheduled and calls consume_from_instance, which
consumes the numa resources based on what is in
host_state.instance_numa_topology
* greenthread 1 completes the scheduling operation
* greenthread 2 gets scheduled and calls consume_from_instance, which
consumes the numa resources based on what is in
host_state.instance_numa_topology - since the resources were already
consumed by greenthread 1, we get the exception above
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1438238/+subscriptions
References