← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1988311] Re: Concurrent evacuation of vms with pinned cpus to the same host fail randomly

 

Reviewed:  https://review.opendev.org/c/openstack/oslo.concurrency/+/855714
Committed: https://opendev.org/openstack/oslo.concurrency/commit/ee3f73a13379a79282325906787aae7da0f6ac27
Submitter: "Zuul (22348)"
Branch:    master

commit ee3f73a13379a79282325906787aae7da0f6ac27
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date:   Sat Sep 3 15:35:35 2022 +0200

    Fix fair internal lock used from eventlet.spawn_n
    
    The fasteners lib in version 0.15.0 removed the
    threading.current_thread workaround for eventlet[1] because eventlet
    seemed to fixed the current_thread issues tracked in [2]. However the
    fix for [2] only fixed half of the problem. The threading.current_thread
    call works if it is called from thread created by eventlet.spawn.
    However if the thread is created with eventlet.spawn_n then
    threading.current_thread is still broken and returns the ID of the
    python native thread.
    
    The fasteners' ReaderWriterLock depends heavily on
    threading.current_thread to decide which thread holds a lock and to
    allow re-entry of that thread. This leads to the situation that
    multiple threads created from spawn_n could take the same
    ReaderWriterLock at the same time.
    
    The fair internal lock in oslo.concurrency uses ReaderWriterLock and
    as a result such lock is broken for threads created with spawn_n.
    
    Note that this issue was raised with eventlet in [3] when the nova team
    detected it via a direct usage of ReaderWriterLock in the nova test
    code. As [3] did not lead to a solution in eventlet nova implemented a
    nova local fix for the test code in [4].
    
    However now we detected that oslo.concurrency is affected by this issue
    as well.
    
    This patch restores the workaround that was removed by [1].
    
    Note that a fasteners issue [5] also opened to restore the
    workaround[1].
    
    [1] https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913
    [2] https://github.com/eventlet/eventlet/issues/172
    [3] https://github.com/eventlet/eventlet/issues/731
    [4] https://review.opendev.org/c/openstack/nova/+/813114
    [5] https://github.com/harlowja/fasteners/issues/96
    
    Closes-Bug: #1988311
    Change-Id: Ia873bcc6b07121c9bd0b94c593567d537b4c1112


** Changed in: oslo.concurrency
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988311

Title:
  Concurrent evacuation of vms with pinned cpus to the same host fail
  randomly

Status in OpenStack Compute (nova):
  In Progress
Status in oslo.concurrency:
  Fix Released

Bug description:
  Reproduction:

  Boot two vms (each with one pinned cpu) on devstack0.
  Then evacuate them to devtack0a.
  devstack0a has two dedicated cpus, so both vms should fit.
  However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.

  devstack0 - all-in-one host
  devstack0a - compute-only host

  # have two dedicated cpus for pinning on the evacuation target host
  devstack0a:/etc/nova/nova-cpu.conf:
  [compute]
  cpu_dedicated_set = 0,1

  # the dedicated cpus are properly tracked in placement
  $ openstack resource provider list
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | uuid                                 | name       | generation | root_provider_uuid                   | parent_provider_uuid |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  | a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a |          2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None                 |
  | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0  |          2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None                 |
  +--------------------------------------+------------+------------+--------------------------------------+----------------------+
  $ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+
  | MEMORY_MB      |              1.5 |        1 |     3923 |      512 |         1 |  3923 |    0 |
  | DISK_GB        |              1.0 |        1 |       28 |        0 |         1 |    28 |    0 |
  | PCPU           |              1.0 |        1 |        2 |        0 |         1 |     2 |    0 |
  +----------------+------------------+----------+----------+----------+-----------+-------+------+

  # use vms with one pinned cpu
  openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated

  # boot two vms (each with one pinned cpu) on devstack0
  n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done                            

  # kill n-cpu on devstack0
  devstack0 $ sudo systemctl stop devstack@n-cpu
  # and force it down, so we can start evacuating
  openstack compute service set devstack0 nova-compute --down

  # evacuate both vms to devstack0a concurrently
  for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done

  # follow up on how the evacuation is going, check if the bug occured, see details a bit below
  for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done

  # clean up
  devstack0 $ sudo systemctl start devstack@n-cpu
  openstack compute service set devstack0 nova-compute --up
  for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done

  This bug is not deterministic. For example out of 10 tries (like
  above) I have seen 4 successes - when both vms successfully evacuated
  to (went to ACTIVE on) devstack0a.

  But in the other 6 cases only one vm evacuated successfully. The other
  vm went to ERROR state, with the error message: "CPU set to pin [0]
  must be a subset of free CPU set [1]". For example:

  $ openstack server show vm2
  ...
  | fault                               | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
  ...

  In n-cpu logs we see the following:

  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     yield
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
  aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self.force_reraise()
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise self.value
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._do_rebuild_instance_with_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     claim_context = rebuild_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     return self._move_claim(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage_from_migration(context, instance, migration,
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     self._update_usage(usage, nodename)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     cn.numa_topology = hardware.numa_usage_from_instance_numa(
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     new_cell.pin_cpus(pinned_cpus)
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server     raise exception.CPUPinningInvalid(requested=list(cpus),
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
  aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server

  devstack 90e5479f
  nova ddcc286ee1
  hypervisor: libvirt/kvm
  libvirt 8.0.0-1ubuntu7.1
  linux 5.15.0-46-generic
  networking: neutron ml2/ovs

  This environment had the default 2 scheduler workers, but AFAIU this
  should not matter, since both vms should fit on the target host and
  this means this is likely a bug in the compute and not in the
  scheduler.

  Let me know if I can collect more information.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988311/+subscriptions



References