yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95164
[Bug 2091033] Re: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours
Reviewed: https://review.opendev.org/c/openstack/nova/+/932669
Committed: https://opendev.org/openstack/nova/commit/0fccb365ddea10b9d6d082c3d95dba24b7fec435
Submitter: "Zuul (22348)"
Branch: master
commit 0fccb365ddea10b9d6d082c3d95dba24b7fec435
Author: melanie witt <melwittt@xxxxxxxxx>
Date: Fri Oct 18 02:54:02 2024 +0000
libvirt: Wrap un-proxied listDevices() and listAllDevices()
This is similar to change I668643c836d46a25df46d4c99a973af5e50a39db
where the objects returned in a list from a libvirt call were not
tpool.Proxy wrapped. Because the objects are not wrapped, calling
methods on them such as listCaps() can block all other greenthreads
and can cause nova-compute to freeze for hours in certain scenarios.
This adds the same wrapping to libvirt calls which return lists of
virNodeDevice.
Closes-Bug: #2091033
Change-Id: I4e423bac26990fc3538e840990294b178c30e374
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2091033
Title:
Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
freeze for hours
Status in OpenStack Compute (nova):
Fix Released
Bug description:
tl;dr This bug has the same root cause as
https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
returned from libvirt are not automatically wrapped in a tpool.Proxy.
Discovered during investigation of a downstream bug [1] where a live
migration was dirtying memory faster than the transfer and nova-
compute became frozen unable to perform any other operations, not even
logging, for hours.
The freezing was tracked down to un-proxied libvirt call
listAllDevices() which could block all other greenthreads. The
listAllDevices() call occurs during the update_available_resource()
periodic task in the libvirt driver in _get_pci_passthrough_devices().
In a GMR collected during a repro of the issue, a traceback showing
this was present in the report [2]:
tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
stderr F `task(self, context)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
stderr F `startup=startup)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
stderr F `startup=startup)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
stderr F `resources = self.driver.get_available_resource(nodename)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
stderr F `in devices.items() if "pci" in dev.listCaps()]`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
stderr F `in devices.items() if "pci" in dev.listCaps()]`
stderr F
stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
The listAllDevices() function returned a list of unwrapped
virNodeDevice objects and so calling listCaps() on such an unwrapped
device could cause a freeze.
Based on the above, the bug reporter was able to test a patch [3] to
wrap listAllDevices() list items in tpool.Proxy and the result showed
nova-compute no longer freezing [4] in the aforementioned scenario.
During investigation it was also noticed that the listDevices() call
list items were not tpool.Proxy wrapped, so this is fixed as well in
the patch.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
[3] https://review.opendev.org/c/openstack/nova/+/932669
[4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2091033/+subscriptions
References