← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1412742] [NEW] tests randomly hang in oslo.messaging fake driver

 

Public bug reported:

The following change has been getting random failures in the py27 test
job where it would time out after 50 minutes.

https://review.openstack.org/#/c/140725/

eg this test run:

http://logs.openstack.org/25/140725/11/gate/gate-nova-python27/0dabb50/

Strangely other stuff going into the gate was not affected, and there's
no obvious sign of what in this change would cause such a hang probem.

Running all the changed tests in isolation never showed a hang. Only
when the entire nova test suite was run would it hang, so it seems like
some kind of race condition is hiding there. Probably the fact that this
change deletes some tests in one file and adds tests in another file
makes the race more likely to occurr.

Reproducing the hang locally and then attaching to the procss with GDB
shows the following trace that ends the in oslo.messaging fake driver

(gdb) py-bt
#3 Frame 0x8b71dd0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_fake.py, line 149, in get_exchange (self=<FakeExchangeManager(_exchanges_lock=<Semaphore(counter=0, _waiters=set([])) at remote 0x954c1d0>, _exchanges={'nova': <FakeExchange(_server_queues={('fake', 'foo'): []}, _topic_queues={('fake', None): []}, _queues_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<Semaphore(counter=1, _waiters=set([])) at remote 0x7667f10>, _RLock__count=0) at remote 0x7667a10>, name='nova') at remote 0x7667910>}, _default_exchange='nova') at remote 0x954c850>, name='nova')
    return self._exchanges.setdefault(name, FakeExchange(name))
#6 Frame 0xf2281d0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_fake.py, line 63, in poll (self=<FakeListener(_pool=None, _targets=[<Target(version=None, exchange='nova', namespace=None, server='foo', topic='fake', fanout=None) at remote 0x7667cd0>, <Target(version=None, exchange='nova', namespace=None, server=None, topic='fake', fanout=None) at remote 0x7667790>], driver=<FakeDriver(_allowed_remote_exmods=['nova.exception', 'nova.test'], _exchange_manager=<FakeExchangeManager(_exchanges_lock=<Semaphore(counter=0, _waiters=set([])) at remote 0x954c1d0>, _exchanges={'nova': <FakeExchange(_server_queues={('fake', 'foo'): []}, _topic_queues={('fake', None): []}, _queues_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<Semaphore(counter=1, _waiters=set([])) at remote 0x7667f10>, _RLock__count=0) at remote 0x7667a10>, name='nova') at remote 0x7667910>}, _default_exchange='nova') at remote 0x954c850>, _url=<TransportURL(hosts=[], _tr...(truncated)
    exchange = self._exchange_manager.get_exchange(target.exchange)
#10 Frame 0x710e9d0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_executors/impl_eventlet.py, line 88, in _executor_thread (incoming=None)
    incoming = self.listener.poll(timeout=base.POLL_TIMEOUT)
#15 Frame 0x8204da0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo_utils/excutils.py, line 92, in inner_func (args=(), kwargs={}, last_log_time=1421749084, last_exc_message=u'', exc_count=0, exc=<TimeoutException at remote 0x10824a50>, this_exc_message=u'', cur_time=1421749084)
    return infunc(*args, **kwargs)
#20 Frame 0x102669a0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/greenthread.py, line 214, in main (self=<GreenThread(_exit_funcs=<collections.deque at remote 0x10608050>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x7667150>, _resolving_links=False) at remote 0xe1d6eb0>, function=<function at remote 0xe47f8c0>, args=(), kwargs={})
    result = function(*args, **kwargs)
#31 Frame 0x84deca0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/timer.py, line 58, in __call__ (self=<Timer(seconds=<float at remote 0xbf93860>, tpl=(<built-in method switch of GreenThread object at remote 0xe1d6eb0>, (), {}), called=True) at remote 0xcedeb90>, args=(...), cb=<built-in method switch of GreenThread object at remote 0xe1d6eb0>, kw={...})
    cb(*args, **kw)
#41 Frame 0xf1e5ff0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/hub.py, line 457, in fire_timers (self=<Hub(next_timers=[], clock=<built-in function time>, debug_exceptions=True, debug_blocking_resolution=1, modify=<built-in method modify of select.epoll object at remote 0xe80eca8>, running=True, debug_blocking=False, listeners={'read': {11: <FdListener(fileno=11, cb=<built-in method switch of GreenThread object at remote 0xe1d6c30>, spent=False, greenlet=<GreenThread(_exit_funcs=<collections.deque at remote 0xe1a5210>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x10938950>, _resolving_links=False) at remote 0xe1d6c30>, evtype='read', mark_as_closed=<instancemethod at remote 0xdde30a0>, tb=<built-in method throw of GreenThread object at remote 0xe1d6c30>) at remote 0x10938cd0>, 13: <FdListener(fileno=13, cb=<built-in method switch of GreenThread object at remote 0xef42c30>, spent=False, greenlet=<GreenThread(_...(truncated)
    timer()
#44 Frame 0xd7d4970, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/hub.py, line 336, in run (self=<Hub(next_timers=[], clock=<built-in function time>, debug_exceptions=True, debug_blocking_resolution=1, modify=<built-in method modify of select.epoll object at remote 0xe80eca8>, running=True, debug_blocking=False, listeners={'read': {11: <FdListener(fileno=11, cb=<built-in method switch of GreenThread object at remote 0xe1d6c30>, spent=False, greenlet=<GreenThread(_exit_funcs=<collections.deque at remote 0xe1a5210>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x10938950>, _resolving_links=False) at remote 0xe1d6c30>, evtype='read', mark_as_closed=<instancemethod at remote 0xdde30a0>, tb=<built-in method throw of GreenThread object at remote 0xe1d6c30>) at remote 0x10938cd0>, 13: <FdListener(fileno=13, cb=<built-in method switch of GreenThread object at remote 0xef42c30>, spent=False, greenlet=<GreenThread(_exit_fun...(truncated)
    self.fire_timers(self.clock())


Upon Ctrl-C'ing the hung test suite it shows it was trying to run this test case at the time:

^C{1} nova.tests.unit.virt.libvirt.test_volume.LibvirtVolumeTestCase.test_libvirt_kvm_iser_volume_with_multipath_getmpdev [] ... inprogress
Traceback (most recent call last):
  File "/home/berrange/src/cloud/nova/.tox/py27/bin/subunit-trace", line 11, in <module>
    sys.exit(main())
  File "/home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/tempest_lib/cmd/subunit_trace.py", line 257, in main
    stream.run(result)
  File "/home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/subunit/v2.py", line 270, in run
    content = self.source.read(1)
KeyboardInterrupt
ERROR: KEYBOARDINTERRUPT
interrupted
ERROR: keyboardinterrupt


Again, running that test case in isolation never fails.


Seeing this recent commit

commit 2809fab96c4b48f2a979965b3f4788dfb1379977
Author: Sean Dague <sean@xxxxxxxxx>
Date:   Fri Jan 16 14:41:24 2015 -0500

    increase fake rpc POLL_TIMEOUT to 0.1s
    
    The 0.001s POLL_TIMEOUT appeared related to an intermitent race
    issue. The polling loop in the fake driver in oslo.messaging is 0.05s,
    so increase to 0.1 total time out to make sure we get at least one
    full polling cycle on waits.
    
    Change-Id: I24a83638df0c8f3f84bb528333a91491a98016a3

suggests to me that the problem could be related to the fact that the
nova.tests.unit.virt.libvirt.test_volume.LibvirtVolumeTestCase.test_libvirt_kvm_iser_volume_with_multipath_getmpdev
test stubs out  time.sleep to be a no-op

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1412742

Title:
  tests randomly hang in oslo.messaging fake driver

Status in OpenStack Compute (Nova):
  New

Bug description:
  The following change has been getting random failures in the py27 test
  job where it would time out after 50 minutes.

  https://review.openstack.org/#/c/140725/

  eg this test run:

  http://logs.openstack.org/25/140725/11/gate/gate-nova-
  python27/0dabb50/

  Strangely other stuff going into the gate was not affected, and
  there's no obvious sign of what in this change would cause such a hang
  probem.

  Running all the changed tests in isolation never showed a hang. Only
  when the entire nova test suite was run would it hang, so it seems
  like some kind of race condition is hiding there. Probably the fact
  that this change deletes some tests in one file and adds tests in
  another file makes the race more likely to occurr.

  Reproducing the hang locally and then attaching to the procss with GDB
  shows the following trace that ends the in oslo.messaging fake driver

  (gdb) py-bt
  #3 Frame 0x8b71dd0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_fake.py, line 149, in get_exchange (self=<FakeExchangeManager(_exchanges_lock=<Semaphore(counter=0, _waiters=set([])) at remote 0x954c1d0>, _exchanges={'nova': <FakeExchange(_server_queues={('fake', 'foo'): []}, _topic_queues={('fake', None): []}, _queues_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<Semaphore(counter=1, _waiters=set([])) at remote 0x7667f10>, _RLock__count=0) at remote 0x7667a10>, name='nova') at remote 0x7667910>}, _default_exchange='nova') at remote 0x954c850>, name='nova')
      return self._exchanges.setdefault(name, FakeExchange(name))
  #6 Frame 0xf2281d0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_fake.py, line 63, in poll (self=<FakeListener(_pool=None, _targets=[<Target(version=None, exchange='nova', namespace=None, server='foo', topic='fake', fanout=None) at remote 0x7667cd0>, <Target(version=None, exchange='nova', namespace=None, server=None, topic='fake', fanout=None) at remote 0x7667790>], driver=<FakeDriver(_allowed_remote_exmods=['nova.exception', 'nova.test'], _exchange_manager=<FakeExchangeManager(_exchanges_lock=<Semaphore(counter=0, _waiters=set([])) at remote 0x954c1d0>, _exchanges={'nova': <FakeExchange(_server_queues={('fake', 'foo'): []}, _topic_queues={('fake', None): []}, _queues_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<Semaphore(counter=1, _waiters=set([])) at remote 0x7667f10>, _RLock__count=0) at remote 0x7667a10>, name='nova') at remote 0x7667910>}, _default_exchange='nova') at remote 0x954c850>, _url=<TransportURL(hosts=[], _tr...(truncated)
      exchange = self._exchange_manager.get_exchange(target.exchange)
  #10 Frame 0x710e9d0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo/messaging/_executors/impl_eventlet.py, line 88, in _executor_thread (incoming=None)
      incoming = self.listener.poll(timeout=base.POLL_TIMEOUT)
  #15 Frame 0x8204da0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/oslo_utils/excutils.py, line 92, in inner_func (args=(), kwargs={}, last_log_time=1421749084, last_exc_message=u'', exc_count=0, exc=<TimeoutException at remote 0x10824a50>, this_exc_message=u'', cur_time=1421749084)
      return infunc(*args, **kwargs)
  #20 Frame 0x102669a0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/greenthread.py, line 214, in main (self=<GreenThread(_exit_funcs=<collections.deque at remote 0x10608050>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x7667150>, _resolving_links=False) at remote 0xe1d6eb0>, function=<function at remote 0xe47f8c0>, args=(), kwargs={})
      result = function(*args, **kwargs)
  #31 Frame 0x84deca0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/timer.py, line 58, in __call__ (self=<Timer(seconds=<float at remote 0xbf93860>, tpl=(<built-in method switch of GreenThread object at remote 0xe1d6eb0>, (), {}), called=True) at remote 0xcedeb90>, args=(...), cb=<built-in method switch of GreenThread object at remote 0xe1d6eb0>, kw={...})
      cb(*args, **kw)
  #41 Frame 0xf1e5ff0, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/hub.py, line 457, in fire_timers (self=<Hub(next_timers=[], clock=<built-in function time>, debug_exceptions=True, debug_blocking_resolution=1, modify=<built-in method modify of select.epoll object at remote 0xe80eca8>, running=True, debug_blocking=False, listeners={'read': {11: <FdListener(fileno=11, cb=<built-in method switch of GreenThread object at remote 0xe1d6c30>, spent=False, greenlet=<GreenThread(_exit_funcs=<collections.deque at remote 0xe1a5210>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x10938950>, _resolving_links=False) at remote 0xe1d6c30>, evtype='read', mark_as_closed=<instancemethod at remote 0xdde30a0>, tb=<built-in method throw of GreenThread object at remote 0xe1d6c30>) at remote 0x10938cd0>, 13: <FdListener(fileno=13, cb=<built-in method switch of GreenThread object at remote 0xef42c30>, spent=False, greenlet=<GreenThread(_...(truncated)
      timer()
  #44 Frame 0xd7d4970, for file /home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/eventlet/hubs/hub.py, line 336, in run (self=<Hub(next_timers=[], clock=<built-in function time>, debug_exceptions=True, debug_blocking_resolution=1, modify=<built-in method modify of select.epoll object at remote 0xe80eca8>, running=True, debug_blocking=False, listeners={'read': {11: <FdListener(fileno=11, cb=<built-in method switch of GreenThread object at remote 0xe1d6c30>, spent=False, greenlet=<GreenThread(_exit_funcs=<collections.deque at remote 0xe1a5210>, _exit_event=<Event(_result=<NOT_USED() at remote 0x23f6c20>, _waiters=set([]), _exc=None) at remote 0x10938950>, _resolving_links=False) at remote 0xe1d6c30>, evtype='read', mark_as_closed=<instancemethod at remote 0xdde30a0>, tb=<built-in method throw of GreenThread object at remote 0xe1d6c30>) at remote 0x10938cd0>, 13: <FdListener(fileno=13, cb=<built-in method switch of GreenThread object at remote 0xef42c30>, spent=False, greenlet=<GreenThread(_exit_fun...(truncated)
      self.fire_timers(self.clock())

  
  Upon Ctrl-C'ing the hung test suite it shows it was trying to run this test case at the time:

  ^C{1} nova.tests.unit.virt.libvirt.test_volume.LibvirtVolumeTestCase.test_libvirt_kvm_iser_volume_with_multipath_getmpdev [] ... inprogress
  Traceback (most recent call last):
    File "/home/berrange/src/cloud/nova/.tox/py27/bin/subunit-trace", line 11, in <module>
      sys.exit(main())
    File "/home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/tempest_lib/cmd/subunit_trace.py", line 257, in main
      stream.run(result)
    File "/home/berrange/src/cloud/nova/.tox/py27/lib/python2.7/site-packages/subunit/v2.py", line 270, in run
      content = self.source.read(1)
  KeyboardInterrupt
  ERROR: KEYBOARDINTERRUPT
  interrupted
  ERROR: keyboardinterrupt

  
  Again, running that test case in isolation never fails.

  
  Seeing this recent commit

  commit 2809fab96c4b48f2a979965b3f4788dfb1379977
  Author: Sean Dague <sean@xxxxxxxxx>
  Date:   Fri Jan 16 14:41:24 2015 -0500

      increase fake rpc POLL_TIMEOUT to 0.1s
      
      The 0.001s POLL_TIMEOUT appeared related to an intermitent race
      issue. The polling loop in the fake driver in oslo.messaging is 0.05s,
      so increase to 0.1 total time out to make sure we get at least one
      full polling cycle on waits.
      
      Change-Id: I24a83638df0c8f3f84bb528333a91491a98016a3

  suggests to me that the problem could be related to the fact that the
  nova.tests.unit.virt.libvirt.test_volume.LibvirtVolumeTestCase.test_libvirt_kvm_iser_volume_with_multipath_getmpdev
  test stubs out  time.sleep to be a no-op

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1412742/+subscriptions


Follow ups

References