yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #89634
[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly
I think fasteners >= 0.15 + eventlet.spawn_n breaks oslo.concurrency
fair locking. Here is a minimal reproduction that showing the issue:
https://gist.github.com/gibizer/9051369e67fd46a20d52963dac534852
It seems it all boils down to the eventlet issue:
https://github.com/eventlet/eventlet/issues/731
Nova did a test only fix in the past
https://review.opendev.org/c/openstack/nova/+/813114 but something more
global is needed as oslo.concurrency uses fasteners' ReaderWriterLock
under the hood to implement fair locking:
https://github.com/openstack/oslo.concurrency/blob/5.0.0/oslo_concurrency/lockutils.py#L287
We can try to adapt the fix from
https://review.opendev.org/c/openstack/nova/+/813114 and apply it for
every fair lock created by oslo.
** Bug watch added: github.com/eventlet/eventlet/issues #731
https://github.com/eventlet/eventlet/issues/731
** Also affects: oslo.concurrency
Importance: Undecided
Status: New
** Changed in: nova
Status: New => Confirmed
** Changed in: nova
Status: Confirmed => Triaged
** Changed in: nova
Importance: Undecided => Critical
** Tags added: oslo zed-rc-potential
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311
Title:
Concurrent evacuation of vms with pinned cpus to the same host fail
randomly
Status in OpenStack Compute (nova):
Triaged
Status in oslo.concurrency:
New
Bug description:
Reproduction:
Boot two vms (each with one pinned cpu) on devstack0.
Then evacuate them to devtack0a.
devstack0a has two dedicated cpus, so both vms should fit.
However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.
devstack0 - all-in-one host
devstack0a - compute-only host
# have two dedicated cpus for pinning on the evacuation target host
devstack0a:/etc/nova/nova-cpu.conf:
[compute]
cpu_dedicated_set = 0,1
# the dedicated cpus are properly tracked in placement
$ openstack resource provider list
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a | 2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None |
| 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0 | 2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
$ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| MEMORY_MB | 1.5 | 1 | 3923 | 512 | 1 | 3923 | 0 |
| DISK_GB | 1.0 | 1 | 28 | 0 | 1 | 28 | 0 |
| PCPU | 1.0 | 1 | 2 | 0 | 1 | 2 | 0 |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
# use vms with one pinned cpu
openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
# boot two vms (each with one pinned cpu) on devstack0
n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done
# kill n-cpu on devstack0
devstack0 $ sudo systemctl stop devstack@n-cpu
# and force it down, so we can start evacuating
openstack compute service set devstack0 nova-compute --down
# evacuate both vms to devstack0a concurrently
for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done
# follow up on how the evacuation is going, check if the bug occured, see details a bit below
for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done
# clean up
devstack0 $ sudo systemctl start devstack@n-cpu
openstack compute service set devstack0 nova-compute --up
for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done
This bug is not deterministic. For example out of 10 tries (like
above) I have seen 4 successes - when both vms successfully evacuated
to (went to ACTIVE on) devstack0a.
But in the other 6 cases only one vm evacuated successfully. The other
vm went to ERROR state, with the error message: "CPU set to pin [0]
must be a subset of free CPU set [1]". For example:
$ openstack server show vm2
...
| fault | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
...
In n-cpu logs we see the following:
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] yield
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server
devstack 90e5479f
nova ddcc286ee1
hypervisor: libvirt/kvm
libvirt 8.0.0-1ubuntu7.1
linux 5.15.0-46-generic
networking: neutron ml2/ovs
This environment had the default 2 scheduler workers, but AFAIU this
should not matter, since both vms should fit on the target host and
this means this is likely a bug in the compute and not in the
scheduler.
Let me know if I can collect more information.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions
References