← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1607461] Re: nova-compute hangs while executing a blocking call to librbd

 

Reviewed:  https://review.openstack.org/348492
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3405a28688eacbca23cf5cac0a611d33fb1a1f2c
Submitter: Jenkins
Branch:    master

commit 3405a28688eacbca23cf5cac0a611d33fb1a1f2c
Author: Roman Podoliaka <rpodolyaka@xxxxxxxxxxxx>
Date:   Thu Jul 28 20:08:44 2016 +0300

    rbd_utils: wrap blocking calls in tpool.Proxy()
    
    librbd is a Python binding around a C library, which is not aware of
    eventlet - all the calls to the functions from this library will block
    the whole nova-compute process for duration of a call. To make sure
    nova-compute remains responsive we need to wrap all the calls in
    tpool.Proxy() eventlet helper, that switches the execution context
    back to the event loop, while the call is executed in a native OS
    thread from a pool.
    
    Prefer tpool.Proxy() to tpool.execute() here as the former allows for
    wrapping objects and automatically executes all the method calls in
    native OS threads, while the latter needs to be applied to each
    method call in the code repeatedly.
    
    Existing calls are modified for the sake of consistency.
    
    Closes-Bug: #1607461
    
    Change-Id: I743ab372332eb656258a476ae91f5e8fd2cbdc99


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1607461

Title:
  nova-compute hangs while executing a blocking call to librbd

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  While executing a call to librbd nova-compute may hang for a while
  (looks like at least some calls can take a really long time depending
  on the health of a Ceph cluster and things like
  http://docs.ceph.com/docs/master/rbd/librbdpy/#rbd.RBD.list are
  inherently slow down as the number of entities to be listed grows) and
  eventually go down in nova service-list output.

  strace'ing shows that a process is stuck on acquiring a mutex:

  root@node-153:~# strace -p 16675
  Process 16675 attached
  futex(0x7fff084ce36c, FUTEX_WAIT_PRIVATE, 1, NULL

  gdb allows to see the traceback:

  http://paste.openstack.org/show/542534/

  ^ which basically means calls to librbd (C library) are not monkey-
  patched and do not allow to switch the execution context to another
  green thread in an eventlet-based process.

  To avoid blocking of the whole nova-compute process on calls to librbd
  we should wrap them with tpool.execute()
  (http://eventlet.net/doc/threading.html#eventlet.tpool.execute)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1607461/+subscriptions


References