yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #11717
[Bug 1293641] [NEW] Libvirt's close callback causes deadlocks
Public bug reported:
Libvirt's close callback causes deadlocks
Unlike libvirt's lifecycle event callback which is triggered every time
an event occurs, the close callback is only triggered when an attempt is
made to use a connection that has been closed. In that case, the
sequence of events is usually as follows:
LibvirtDriver._get_connection acquires _wrapped_conn_lock
LibvirtDriver._get_connection calls _test_connection
LibvirtDriver._test_connection calls libvirt.getLibVersion
libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock
_get_connection cannot release the lock because it is waiting for
_close_callback to return and the latter cannot complete until it has
acquired the lock.
Making the handling of the close callback asynchronous (like the
lifecycle event handling) won't work, because by the time the lock is
released, the connection object that was passed into the callback will
no longer be equal to LibvirtDriver._wrapped_conn. Even if the
connection object is ignored, the instance will have already been
disabled via the _get_new_connection method's existing error-handling
logic.
The best solution would appear to be to simply not register a close
callback. The only case where it might provide some benefit is when a
connection is closed after _get_connection has returned a reference to
it. The benefit of disabling the instance a little earlier in such
marginal cases is arguably outweighed by the complexity of a thread-safe
implementation, especially when the difficulty of testing such an
implementation (to ensure it is indeed thread safe) is taken into
consideration.
Note that having _close_callback use _wrapped_conn_lock.acquire(False)
instead of "with _wrapped_conn_lock" by itself is not a viable solution,
because due to connections being opened via tpool.proxy_call, the close
callback is called by a native thread, which means it should not be used
to perform the various operations (including logging) involved in
disabling the instance.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1293641
Title:
Libvirt's close callback causes deadlocks
Status in OpenStack Compute (Nova):
New
Bug description:
Libvirt's close callback causes deadlocks
Unlike libvirt's lifecycle event callback which is triggered every
time an event occurs, the close callback is only triggered when an
attempt is made to use a connection that has been closed. In that
case, the sequence of events is usually as follows:
LibvirtDriver._get_connection acquires _wrapped_conn_lock
LibvirtDriver._get_connection calls _test_connection
LibvirtDriver._test_connection calls libvirt.getLibVersion
libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock
_get_connection cannot release the lock because it is waiting for
_close_callback to return and the latter cannot complete until it has
acquired the lock.
Making the handling of the close callback asynchronous (like the
lifecycle event handling) won't work, because by the time the lock is
released, the connection object that was passed into the callback will
no longer be equal to LibvirtDriver._wrapped_conn. Even if the
connection object is ignored, the instance will have already been
disabled via the _get_new_connection method's existing error-handling
logic.
The best solution would appear to be to simply not register a close
callback. The only case where it might provide some benefit is when a
connection is closed after _get_connection has returned a reference to
it. The benefit of disabling the instance a little earlier in such
marginal cases is arguably outweighed by the complexity of a thread-
safe implementation, especially when the difficulty of testing such an
implementation (to ensure it is indeed thread safe) is taken into
consideration.
Note that having _close_callback use _wrapped_conn_lock.acquire(False)
instead of "with _wrapped_conn_lock" by itself is not a viable
solution, because due to connections being opened via
tpool.proxy_call, the close callback is called by a native thread,
which means it should not be used to perform the various operations
(including logging) involved in disabling the instance.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1293641/+subscriptions
Follow ups
References