← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2091033] Re: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours

 

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: nova (Ubuntu)
   Importance: Undecided
       Status: New

** Description changed:

+ [Impact]
+ 
+ Nova uses evently.tpool.Proxy to defer actions/commands that would
+ otherwise risk starving eventlets. This patch fixes the issue where
+ virNodeDevice returned from libvirt were not wrapped by the proxy and
+ therefore executed outside the thread which leads to starvation. There
+ are two patches required to fix this issue, the first is the one in this
+ bug and the second is to fix a regression subsequently identified by the
+ first patch (bug 2098892).
+ 
+ [Test Plan]
+ 
+  * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days
+  * ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this)
+ 
+ [Regression Potential]
+ 
+  * no regression potential is expected as a result of this set of
+ patches.
+ 
+ --------------------------------------------------------------------------
+ 
  tl;dr This bug has the same root cause as
  https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
  returned from libvirt are not automatically wrapped in a tpool.Proxy.
  
  Discovered during investigation of a downstream bug [1] where a live
  migration was dirtying memory faster than the transfer and nova-compute
  became frozen unable to perform any other operations, not even logging,
  for hours.
  
  The freezing was tracked down to un-proxied libvirt call
  listAllDevices() which could block all other greenthreads. The
  listAllDevices() call occurs during the update_available_resource()
  periodic task in the libvirt driver in _get_pci_passthrough_devices().
  In a GMR collected during a repro of the issue, a traceback showing this
  was present in the report [2]:
  
  tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
  stderr F     `task(self, context)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
  stderr F     `resources = self.driver.get_available_resource(nodename)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
  stderr F     `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
  stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
  
  The listAllDevices() function returned a list of unwrapped virNodeDevice
  objects and so calling listCaps() on such an unwrapped device could
  cause a freeze.
  
  Based on the above, the bug reporter was able to test a patch [3] to
  wrap listAllDevices() list items in tpool.Proxy and the result showed
  nova-compute no longer freezing [4] in the aforementioned scenario.
  
  During investigation it was also noticed that the listDevices() call
  list items were not tpool.Proxy wrapped, so this is fixed as well in the
  patch.
  
  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
  [3] https://review.opendev.org/c/openstack/nova/+/932669
  [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

** Also affects: nova (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: nova (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/caracal
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2091033

Title:
  Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
  freeze for hours

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
  Fix Committed
Status in OpenStack Compute (nova) 2024.2 series:
  Fix Committed
Status in OpenStack Compute (nova) antelope series:
  Fix Committed
Status in OpenStack Compute (nova) bobcat series:
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Jammy:
  New
Status in nova source package in Noble:
  New

Bug description:
  [Impact]

  Nova uses evently.tpool.Proxy to defer actions/commands that would
  otherwise risk starving eventlets. This patch fixes the issue where
  virNodeDevice returned from libvirt were not wrapped by the proxy and
  therefore executed outside the thread which leads to starvation. There
  are two patches required to fix this issue, the first is the one in
  this bug and the second is to fix a regression subsequently identified
  by the first patch (bug 2098892).

  [Test Plan]

   * Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days
   * ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this)

  [Regression Potential]

   * no regression potential is expected as a result of this set of
  patches.

  --------------------------------------------------------------------------

  tl;dr This bug has the same root cause as
  https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
  returned from libvirt are not automatically wrapped in a tpool.Proxy.

  Discovered during investigation of a downstream bug [1] where a live
  migration was dirtying memory faster than the transfer and nova-
  compute became frozen unable to perform any other operations, not even
  logging, for hours.

  The freezing was tracked down to un-proxied libvirt call
  listAllDevices() which could block all other greenthreads. The
  listAllDevices() call occurs during the update_available_resource()
  periodic task in the libvirt driver in _get_pci_passthrough_devices().
  In a GMR collected during a repro of the issue, a traceback showing
  this was present in the report [2]:

  tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
  stderr F     `task(self, context)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
  stderr F     `resources = self.driver.get_available_resource(nodename)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
  stderr F     `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
  stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`

  The listAllDevices() function returned a list of unwrapped
  virNodeDevice objects and so calling listCaps() on such an unwrapped
  device could cause a freeze.

  Based on the above, the bug reporter was able to test a patch [3] to
  wrap listAllDevices() list items in tpool.Proxy and the result showed
  nova-compute no longer freezing [4] in the aforementioned scenario.

  During investigation it was also noticed that the listDevices() call
  list items were not tpool.Proxy wrapped, so this is fixed as well in
  the patch.

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
  [3] https://review.opendev.org/c/openstack/nova/+/932669
  [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions



References