yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1293641] [NEW] Libvirt's close callback causes deadlocks

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: John Warren <jswarren@xxxxxxxxxx>
Date: Mon, 17 Mar 2014 15:58:47 -0000
Reply-to: Bug 1293641 <1293641@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Libvirt's close callback causes deadlocks

Unlike libvirt's lifecycle event callback which is triggered every time
an event occurs, the close callback is only triggered when an attempt is
made to use a connection that has been closed.  In that case, the
sequence of events is usually as follows:

LibvirtDriver._get_connection acquires _wrapped_conn_lock
LibvirtDriver._get_connection calls _test_connection
LibvirtDriver._test_connection calls libvirt.getLibVersion
libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock

_get_connection cannot release the lock because it is waiting for
_close_callback to return and the latter cannot complete until it has
acquired the lock.

Making the handling of the close callback asynchronous (like the
lifecycle event handling) won't work, because by the time the lock is
released, the connection object that was passed into the callback will
no longer be equal to LibvirtDriver._wrapped_conn.  Even if the
connection object is ignored, the instance will have already been
disabled via the _get_new_connection method's existing error-handling
logic.

The best solution would appear to be to simply not register a close
callback.  The only case where it might provide some benefit is when a
connection is closed after _get_connection has returned a reference to
it.  The benefit of disabling the instance a little earlier in such
marginal cases is arguably outweighed by the complexity of a thread-safe
implementation, especially when the difficulty of testing such an
implementation (to ensure it is indeed thread safe) is taken into
consideration.

Note that having _close_callback use _wrapped_conn_lock.acquire(False)
instead of "with _wrapped_conn_lock" by itself is not a viable solution,
because due to connections being opened via tpool.proxy_call, the close
callback is called by a native thread, which means it should not be used
to perform the various operations (including logging) involved in
disabling the instance.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1293641

Title:
  Libvirt's close callback causes deadlocks

Status in OpenStack Compute (Nova):
  New

Bug description:
  Libvirt's close callback causes deadlocks

  Unlike libvirt's lifecycle event callback which is triggered every
  time an event occurs, the close callback is only triggered when an
  attempt is made to use a connection that has been closed.  In that
  case, the sequence of events is usually as follows:

  LibvirtDriver._get_connection acquires _wrapped_conn_lock
  LibvirtDriver._get_connection calls _test_connection
  LibvirtDriver._test_connection calls libvirt.getLibVersion
  libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
  LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock

  _get_connection cannot release the lock because it is waiting for
  _close_callback to return and the latter cannot complete until it has
  acquired the lock.

  Making the handling of the close callback asynchronous (like the
  lifecycle event handling) won't work, because by the time the lock is
  released, the connection object that was passed into the callback will
  no longer be equal to LibvirtDriver._wrapped_conn.  Even if the
  connection object is ignored, the instance will have already been
  disabled via the _get_new_connection method's existing error-handling
  logic.

  The best solution would appear to be to simply not register a close
  callback.  The only case where it might provide some benefit is when a
  connection is closed after _get_connection has returned a reference to
  it.  The benefit of disabling the instance a little earlier in such
  marginal cases is arguably outweighed by the complexity of a thread-
  safe implementation, especially when the difficulty of testing such an
  implementation (to ensure it is indeed thread safe) is taken into
  consideration.

  Note that having _close_callback use _wrapped_conn_lock.acquire(False)
  instead of "with _wrapped_conn_lock" by itself is not a viable
  solution, because due to connections being opened via
  tpool.proxy_call, the close callback is called by a native thread,
  which means it should not be used to perform the various operations
  (including logging) involved in disabling the instance.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1293641/+subscriptions

Follow ups

[Bug 1293641] Re: Libvirt's close callback causes deadlocks
From: Thierry Carrez, 2014-03-31
[Bug 1293641] [NEW] Libvirt's close callback causes deadlocks
From: John Warren, 2014-03-17

References

[Bug 1293641] [NEW] Libvirt's close callback causes deadlocks
From: John Warren, 2014-03-17