← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1840912] [NEW] libvirt calls aren't reliably using tpool.Proxy

 

Public bug reported:

A customer is hitting an issue with symptoms identical to bug 1045152
(from 2012). Specifically, we are frequently seeing the compute host
being marked down. From log correlation, we can see that when this
occurs the relevant compute is always in the middle of executing
LibvirtDriver._get_disk_over_committed_size_total(). The reason for this
appears to be a long-running libvirt call which is not using
tpool.Proxy, and therefore blocks all other greenthreads during
execution. We do not yet know why the libvirt call is slow, but we have
identified the reason it is not using tpool.Proxy.

Because eventlet, we proxy libvirt calls at the point we create the
libvirt connection in libvirt.Host._connect:

        return tpool.proxy_call(
            (libvirt.virDomain, libvirt.virConnect),
            libvirt.openAuth, uri, auth, flags)

This means: run libvirt.openAuth(uri, auth, flags) in a native thread.
If the returned object is a libvirt.virDomain or libvirt.virConnect,
wrap the returned object in a tpool.Proxy with the same autowrap rules.

There are 2 problems with this. Firstly, the autowrap list is
incomplete. At the very least we need to add libvirt.virNodeDevice,
libvirt.virSecret, and libvirt.NWFilter to this list as we use all of
these objects in Nova. Currently none of our interactions with these
objects are using the tpool proxy.

Secondly, and the specific root cause of this bug, it doesn't understand
lists:

https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

In LibvirtDriver._get_disk_over_committed_size_total() we get a list of
running libvirt domains with libvirt.Host.list_instance_domains, which
calls virConnect.listAllDomains(). listAllDomains() returns a *list* of
virDomain, which the above code in tpool doesn't match. Consequently,
none of the subsequent virDomain calls use the tpool proxy, which
starves all other greenthreads.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1840912

Title:
  libvirt calls aren't reliably using tpool.Proxy

Status in OpenStack Compute (nova):
  New

Bug description:
  A customer is hitting an issue with symptoms identical to bug 1045152
  (from 2012). Specifically, we are frequently seeing the compute host
  being marked down. From log correlation, we can see that when this
  occurs the relevant compute is always in the middle of executing
  LibvirtDriver._get_disk_over_committed_size_total(). The reason for
  this appears to be a long-running libvirt call which is not using
  tpool.Proxy, and therefore blocks all other greenthreads during
  execution. We do not yet know why the libvirt call is slow, but we
  have identified the reason it is not using tpool.Proxy.

  Because eventlet, we proxy libvirt calls at the point we create the
  libvirt connection in libvirt.Host._connect:

          return tpool.proxy_call(
              (libvirt.virDomain, libvirt.virConnect),
              libvirt.openAuth, uri, auth, flags)

  This means: run libvirt.openAuth(uri, auth, flags) in a native thread.
  If the returned object is a libvirt.virDomain or libvirt.virConnect,
  wrap the returned object in a tpool.Proxy with the same autowrap
  rules.

  There are 2 problems with this. Firstly, the autowrap list is
  incomplete. At the very least we need to add libvirt.virNodeDevice,
  libvirt.virSecret, and libvirt.NWFilter to this list as we use all of
  these objects in Nova. Currently none of our interactions with these
  objects are using the tpool proxy.

  Secondly, and the specific root cause of this bug, it doesn't
  understand lists:

  https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

  In LibvirtDriver._get_disk_over_committed_size_total() we get a list
  of running libvirt domains with libvirt.Host.list_instance_domains,
  which calls virConnect.listAllDomains(). listAllDomains() returns a
  *list* of virDomain, which the above code in tpool doesn't match.
  Consequently, none of the subsequent virDomain calls use the tpool
  proxy, which starves all other greenthreads.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1840912/+subscriptions