← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1293641] Re: Libvirt's close callback causes deadlocks

 

** Changed in: nova
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1293641

Title:
  Libvirt's close callback causes deadlocks

Status in OpenStack Compute (Nova):
  Fix Released

Bug description:
  Libvirt's close callback causes deadlocks

  Unlike libvirt's lifecycle event callback which is triggered every
  time an event occurs, the close callback is only triggered when an
  attempt is made to use a connection that has been closed.  In that
  case, the sequence of events is usually as follows:

  LibvirtDriver._get_connection acquires _wrapped_conn_lock
  LibvirtDriver._get_connection calls _test_connection
  LibvirtDriver._test_connection calls libvirt.getLibVersion
  libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
  LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock

  _get_connection cannot release the lock because it is waiting for
  _close_callback to return and the latter cannot complete until it has
  acquired the lock.

  Making the handling of the close callback asynchronous (like the
  lifecycle event handling) won't work, because by the time the lock is
  released, the connection object that was passed into the callback will
  no longer be equal to LibvirtDriver._wrapped_conn.  Even if the
  connection object is ignored, the instance will have already been
  disabled via the _get_new_connection method's existing error-handling
  logic.

  The best solution would appear to be to simply not register a close
  callback.  The only case where it might provide some benefit is when a
  connection is closed after _get_connection has returned a reference to
  it.  The benefit of disabling the instance a little earlier in such
  marginal cases is arguably outweighed by the complexity of a thread-
  safe implementation, especially when the difficulty of testing such an
  implementation (to ensure it is indeed thread safe) is taken into
  consideration.

  Note that having _close_callback use _wrapped_conn_lock.acquire(False)
  instead of "with _wrapped_conn_lock" by itself is not a viable
  solution, because due to connections being opened via
  tpool.proxy_call, the close callback is called by a native thread,
  which means it should not be used to perform the various operations
  (including logging) involved in disabling the instance.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1293641/+subscriptions


References