← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly

 

I think fasteners >= 0.15 + eventlet.spawn_n breaks oslo.concurrency
fair locking. Here is a minimal reproduction that showing the issue:
https://gist.github.com/gibizer/9051369e67fd46a20d52963dac534852

It seems it all boils down to the eventlet issue:
https://github.com/eventlet/eventlet/issues/731

Nova did a test only fix in the past
https://review.opendev.org/c/openstack/nova/+/813114 but something more
global is needed as oslo.concurrency uses fasteners' ReaderWriterLock
under the hood to implement fair locking:
https://github.com/openstack/oslo.concurrency/blob/5.0.0/oslo_concurrency/lockutils.py#L287

We can try to adapt the fix from
https://review.opendev.org/c/openstack/nova/+/813114 and apply it for
every fair lock created by oslo.

** Bug watch added: github.com/eventlet/eventlet/issues #731
   https://github.com/eventlet/eventlet/issues/731

** Also affects: oslo.concurrency
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: New => Confirmed

** Changed in: nova
       Status: Confirmed => Triaged

** Changed in: nova
   Importance: Undecided => Critical

** Tags added: oslo zed-rc-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311

Title:
  Concurrent evacuation of vms with pinned cpus to the same host fail
  randomly

Status in OpenStack Compute (nova):
  Triaged
Status in oslo.concurrency:
  New

Bug description:
  Reproduction:

  Boot two vms (each with one pinned cpu) on devstack0.
  Then evacuate them to devtack0a.
  devstack0a has two dedicated cpus, so both vms should fit.
  However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.

  devstack0 - all-in-one host
  devstack0a - compute-only host

  # have two dedicated cpus for pinning on the evacuation target host
  devstack0a:/etc/nova/nova-cpu.conf:
  [compute]
  cpu_dedicated_set = 0,1

  # the dedicated cpus are properly tracked in placement
  $ openstack resource provider list
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | uuid                                 | name       | generation | root_provider_uuid                   | parent_provider_uuid |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a |          2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None                 |
  | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0  |          2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None                 |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  $ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | MEMORY_MB      |              1.5 |        1 |     3923 |      512 |         1 |  3923 |    0 |
  | DISK_GB        |              1.0 |        1 |       28 |        0 |         1 |    28 |    0 |
  | PCPU           |              1.0 |        1 |        2 |        0 |         1 |     2 |    0 |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+

  # use vms with one pinned cpu
  openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated

  # boot two vms (each with one pinned cpu) on devstack0
  n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done                            

  # kill n-cpu on devstack0
  devstack0 $ sudo systemctl stop devstack@n-cpu
  # and force it down, so we can start evacuating
  openstack compute service set devstack0 nova-compute --down

  # evacuate both vms to devstack0a concurrently
  for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done

  # follow up on how the evacuation is going, check if the bug occured, see details a bit below
  for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done

  # clean up
  devstack0 $ sudo systemctl start devstack@n-cpu
  openstack compute service set devstack0 nova-compute --up
  for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done

  This bug is not deterministic. For example out of 10 tries (like
  above) I have seen 4 successes - when both vms successfully evacuated
  to (went to ACTIVE on) devstack0a.

  But in the other 6 cases only one vm evacuated successfully. The other
  vm went to ERROR state, with the error message: "CPU set to pin [0]
  must be a subset of free CPU set [1]". For example:

  $ openstack server show vm2
  ...
  | fault                               | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
  ...

  In n-cpu logs we see the following:

  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     yield
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
  aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server

  devstack 90e5479f
  nova ddcc286ee1
  hypervisor: libvirt/kvm
  libvirt 8.0.0-1ubuntu7.1
  linux 5.15.0-46-generic
  networking: neutron ml2/ovs

  This environment had the default 2 scheduler workers, but AFAIU this
  should not matter, since both vms should fit on the target host and
  this means this is likely a bug in the compute and not in the
  scheduler.

  Let me know if I can collect more information.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions



References