yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #89675
[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly
Setting to High as we need to bump our requirements on master to prevent
older releases of oslo.concurrency.
Also, need to backport the patch into stable releases of
oslo.concurrency for Yoga.
** Also affects: nova/yoga
Importance: Undecided
Status: New
** Changed in: nova/yoga
Status: New => Confirmed
** Changed in: nova/yoga
Importance: Undecided => High
** Changed in: nova
Importance: Critical => High
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311
Title:
Concurrent evacuation of vms with pinned cpus to the same host fail
randomly
Status in OpenStack Compute (nova):
In Progress
Status in OpenStack Compute (nova) yoga series:
Confirmed
Status in oslo.concurrency:
Fix Released
Bug description:
Reproduction:
Boot two vms (each with one pinned cpu) on devstack0.
Then evacuate them to devtack0a.
devstack0a has two dedicated cpus, so both vms should fit.
However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.
devstack0 - all-in-one host
devstack0a - compute-only host
# have two dedicated cpus for pinning on the evacuation target host
devstack0a:/etc/nova/nova-cpu.conf:
[compute]
cpu_dedicated_set = 0,1
# the dedicated cpus are properly tracked in placement
$ openstack resource provider list
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a | 2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None |
| 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0 | 2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
$ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| MEMORY_MB | 1.5 | 1 | 3923 | 512 | 1 | 3923 | 0 |
| DISK_GB | 1.0 | 1 | 28 | 0 | 1 | 28 | 0 |
| PCPU | 1.0 | 1 | 2 | 0 | 1 | 2 | 0 |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
# use vms with one pinned cpu
openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
# boot two vms (each with one pinned cpu) on devstack0
n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done
# kill n-cpu on devstack0
devstack0 $ sudo systemctl stop devstack@n-cpu
# and force it down, so we can start evacuating
openstack compute service set devstack0 nova-compute --down
# evacuate both vms to devstack0a concurrently
for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done
# follow up on how the evacuation is going, check if the bug occured, see details a bit below
for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done
# clean up
devstack0 $ sudo systemctl start devstack@n-cpu
openstack compute service set devstack0 nova-compute --up
for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done
This bug is not deterministic. For example out of 10 tries (like
above) I have seen 4 successes - when both vms successfully evacuated
to (went to ACTIVE on) devstack0a.
But in the other 6 cases only one vm evacuated successfully. The other
vm went to ERROR state, with the error message: "CPU set to pin [0]
must be a subset of free CPU set [1]". For example:
$ openstack server show vm2
...
| fault | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
...
In n-cpu logs we see the following:
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] yield
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server
devstack 90e5479f
nova ddcc286ee1
hypervisor: libvirt/kvm
libvirt 8.0.0-1ubuntu7.1
linux 5.15.0-46-generic
networking: neutron ml2/ovs
This environment had the default 2 scheduler workers, but AFAIU this
should not matter, since both vms should fit on the target host and
this means this is likely a bug in the compute and not in the
scheduler.
Let me know if I can collect more information.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions
References