yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #89639
[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly
Reviewed: https://review.opendev.org/c/openstack/oslo.concurrency/+/855714
Committed: https://opendev.org/openstack/oslo.concurrency/commit/ee3f73a13379a79282325906787aae7da0f6ac27
Submitter: "Zuul (22348)"
Branch: master
commit ee3f73a13379a79282325906787aae7da0f6ac27
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date: Sat Sep 3 15:35:35 2022 +0200
Fix fair internal lock used from eventlet.spawn_n
The fasteners lib in version 0.15.0 removed the
threading.current_thread workaround for eventlet[1] because eventlet
seemed to fixed the current_thread issues tracked in [2]. However the
fix for [2] only fixed half of the problem. The threading.current_thread
call works if it is called from thread created by eventlet.spawn.
However if the thread is created with eventlet.spawn_n then
threading.current_thread is still broken and returns the ID of the
python native thread.
The fasteners' ReaderWriterLock depends heavily on
threading.current_thread to decide which thread holds a lock and to
allow re-entry of that thread. This leads to the situation that
multiple threads created from spawn_n could take the same
ReaderWriterLock at the same time.
The fair internal lock in oslo.concurrency uses ReaderWriterLock and
as a result such lock is broken for threads created with spawn_n.
Note that this issue was raised with eventlet in [3] when the nova team
detected it via a direct usage of ReaderWriterLock in the nova test
code. As [3] did not lead to a solution in eventlet nova implemented a
nova local fix for the test code in [4].
However now we detected that oslo.concurrency is affected by this issue
as well.
This patch restores the workaround that was removed by [1].
Note that a fasteners issue [5] also opened to restore the
workaround[1].
[1] https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913
[2] https://github.com/eventlet/eventlet/issues/172
[3] https://github.com/eventlet/eventlet/issues/731
[4] https://review.opendev.org/c/openstack/nova/+/813114
[5] https://github.com/harlowja/fasteners/issues/96
Closes-Bug: #1988311
Change-Id: Ia873bcc6b07121c9bd0b94c593567d537b4c1112
** Changed in: oslo.concurrency
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311
Title:
Concurrent evacuation of vms with pinned cpus to the same host fail
randomly
Status in OpenStack Compute (nova):
In Progress
Status in oslo.concurrency:
Fix Released
Bug description:
Reproduction:
Boot two vms (each with one pinned cpu) on devstack0.
Then evacuate them to devtack0a.
devstack0a has two dedicated cpus, so both vms should fit.
However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.
devstack0 - all-in-one host
devstack0a - compute-only host
# have two dedicated cpus for pinning on the evacuation target host
devstack0a:/etc/nova/nova-cpu.conf:
[compute]
cpu_dedicated_set = 0,1
# the dedicated cpus are properly tracked in placement
$ openstack resource provider list
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a | 2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None |
| 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0 | 2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
$ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| MEMORY_MB | 1.5 | 1 | 3923 | 512 | 1 | 3923 | 0 |
| DISK_GB | 1.0 | 1 | 28 | 0 | 1 | 28 | 0 |
| PCPU | 1.0 | 1 | 2 | 0 | 1 | 2 | 0 |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
# use vms with one pinned cpu
openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
# boot two vms (each with one pinned cpu) on devstack0
n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done
# kill n-cpu on devstack0
devstack0 $ sudo systemctl stop devstack@n-cpu
# and force it down, so we can start evacuating
openstack compute service set devstack0 nova-compute --down
# evacuate both vms to devstack0a concurrently
for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done
# follow up on how the evacuation is going, check if the bug occured, see details a bit below
for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done
# clean up
devstack0 $ sudo systemctl start devstack@n-cpu
openstack compute service set devstack0 nova-compute --up
for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done
This bug is not deterministic. For example out of 10 tries (like
above) I have seen 4 successes - when both vms successfully evacuated
to (went to ACTIVE on) devstack0a.
But in the other 6 cases only one vm evacuated successfully. The other
vm went to ERROR state, with the error message: "CPU set to pin [0]
must be a subset of free CPU set [1]". For example:
$ openstack server show vm2
...
| fault | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
...
In n-cpu logs we see the following:
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] yield
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server
devstack 90e5479f
nova ddcc286ee1
hypervisor: libvirt/kvm
libvirt 8.0.0-1ubuntu7.1
linux 5.15.0-46-generic
networking: neutron ml2/ovs
This environment had the default 2 scheduler workers, but AFAIU this
should not matter, since both vms should fit on the target host and
this means this is likely a bug in the compute and not in the
scheduler.
Let me know if I can collect more information.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions
References