← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly

 

Setting to High as we need to bump our requirements on master to prevent
older releases of oslo.concurrency.

Also, need to backport the patch into stable releases of
oslo.concurrency for Yoga.

** Also affects: nova/yoga
   Importance: Undecided
       Status: New

** Changed in: nova/yoga
       Status: New => Confirmed

** Changed in: nova/yoga
   Importance: Undecided => High

** Changed in: nova
   Importance: Critical => High

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311

Title:
  Concurrent evacuation of vms with pinned cpus to the same host fail
  randomly

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) yoga series:
  Confirmed
Status in oslo.concurrency:
  Fix Released

Bug description:
  Reproduction:

  Boot two vms (each with one pinned cpu) on devstack0.
  Then evacuate them to devtack0a.
  devstack0a has two dedicated cpus, so both vms should fit.
  However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.

  devstack0 - all-in-one host
  devstack0a - compute-only host

  # have two dedicated cpus for pinning on the evacuation target host
  devstack0a:/etc/nova/nova-cpu.conf:
  [compute]
  cpu_dedicated_set = 0,1

  # the dedicated cpus are properly tracked in placement
  $ openstack resource provider list
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | uuid                                 | name       | generation | root_provider_uuid                   | parent_provider_uuid |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a |          2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None                 |
  | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0  |          2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None                 |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  $ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | MEMORY_MB      |              1.5 |        1 |     3923 |      512 |         1 |  3923 |    0 |
  | DISK_GB        |              1.0 |        1 |       28 |        0 |         1 |    28 |    0 |
  | PCPU           |              1.0 |        1 |        2 |        0 |         1 |     2 |    0 |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+

  # use vms with one pinned cpu
  openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated

  # boot two vms (each with one pinned cpu) on devstack0
  n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done                            

  # kill n-cpu on devstack0
  devstack0 $ sudo systemctl stop devstack@n-cpu
  # and force it down, so we can start evacuating
  openstack compute service set devstack0 nova-compute --down

  # evacuate both vms to devstack0a concurrently
  for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done

  # follow up on how the evacuation is going, check if the bug occured, see details a bit below
  for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done

  # clean up
  devstack0 $ sudo systemctl start devstack@n-cpu
  openstack compute service set devstack0 nova-compute --up
  for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done

  This bug is not deterministic. For example out of 10 tries (like
  above) I have seen 4 successes - when both vms successfully evacuated
  to (went to ACTIVE on) devstack0a.

  But in the other 6 cases only one vm evacuated successfully. The other
  vm went to ERROR state, with the error message: "CPU set to pin [0]
  must be a subset of free CPU set [1]". For example:

  $ openstack server show vm2
  ...
  | fault                               | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
  ...

  In n-cpu logs we see the following:

  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     yield
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
  aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server

  devstack 90e5479f
  nova ddcc286ee1
  hypervisor: libvirt/kvm
  libvirt 8.0.0-1ubuntu7.1
  linux 5.15.0-46-generic
  networking: neutron ml2/ovs

  This environment had the default 2 scheduler workers, but AFAIU this
  should not matter, since both vms should fit on the target host and
  this means this is likely a bug in the compute and not in the
  scheduler.

  Let me know if I can collect more information.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions



References