← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1029727] Re: Using logging module from signal handler can wedge process

 

** Changed in: glance
       Status: Fix Committed => Fix Released

** Changed in: glance
    Milestone: None => grizzly-rc1

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1029727

Title:
  Using logging module from signal handler can wedge process

Status in OpenStack Image Registry and Delivery Service (Glance):
  Fix Released
Status in OpenStack Compute (Nova):
  Fix Released

Bug description:
  Running into an issue with nova-api which may be threading related.
  Having lots of fun trying to reproduce and debug, but not having much
  luck yet.  Filing here to track my progress and hopefully solicit
  input from others.

  First some background... We're triggering multi-node test deployments
  for upstream commits using Juju + MAAS.  The process of deploying
  services with Juju is very much event driven.  A service is deployed,
  and relations are added to other services.  Each new relation added
  fires a number of hook scripts on the machine that do various bits of
  configuration.  For most of the nova components, this involves making
  a few modifications to nova.conf or paste config and restarting the
  running services (using upstart).

  In the case of nova-api, the service is deployed to an empty machine,
  the packages install and the services are stopped.  Then a number of
  relations are added to the other services in the environment
  (keystone, rabbit, mysql, glance, etc).  The nova-api configuration
  flow usually looks something like:

  deploy -> install nova-api, stop service
  mysql relation added  -> set sql_connection, restart nova-api
  rabbit realtion added -> update nova.conf's rabbit settings , restart nova-api
  keystone relation added -> update paste.ini with credentials, restart nova-api
  etc, etc, etc

  This all happens in rapid succession. The hooks themselves are not
  executed concurrently, but since nova doesn't fork and upstart
  launches it in foreground, service restarts are basically
  asynchronous. Depending on how quickly the hooks fire off, the service
  may be stopped and started (restarted) while the previous process is
  still going through its initialization.  So, its likely  that every
  time the service is restarted, its in a different stage of init, with
  different configuration.

  It seems that nova-api being restarted in this manor is not reliable.
  It's not 100% reproducible, but calls to stop nova-api will sometimes
  leave an orphaned child, still bound to the port it was listening
  (but unresponsive to connections, sitting in epoll_wait())  AFAICS,
  the parent process is mistakenly  getting killed off before its
  children.  At this point, nova-api is stopped as far as upstart is
  concerned.  The following hook will try to start nova-api and the
  service will fail because the port is already in use.  The only way to
  recover is to manually kill  the orphaned child and restart the
  service.

  I *was* seeing the following traceback in the api log whenever it hit
  this state, but these have disappeared from the logs over the last day
  or two though the problem still persists.

  2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
  2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
  2012-07-25 13:07:20 INFO nova.service [-] Child 6519 exited with status 0
  2012-07-25 13:07:20 INFO nova.service [-] Child 6518 exited with status 0
  2012-07-25 13:07:20 INFO nova.service [-] Child 6517 exited with status 0
  2012-07-25 13:07:20 ERROR nova.service [-] Unhandled exception
  2012-07-25 13:07:20 TRACE nova.service Traceback (most recent call last):
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 289, in _start_child
  2012-07-25 13:07:20 TRACE nova.service     self._child_process(wrap.server)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 266, in _child_process
  2012-07-25 13:07:20 TRACE nova.service     launcher.run_server(server)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 127, in run_server
  2012-07-25 13:07:20 TRACE nova.service     server.wait()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 619, in wait
  2012-07-25 13:07:20 TRACE nova.service     self.server.wait()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/wsgi.py", line 113, in wait
  2012-07-25 13:07:20 TRACE nova.service     self._server.wait()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
  2012-07-25 13:07:20 TRACE nova.service     return self._exit_event.wait()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
  2012-07-25 13:07:20 TRACE nova.service     return hubs.get_hub().switch()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
  2012-07-25 13:07:20 TRACE nova.service     return self.greenlet.switch()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 226, in run
  2012-07-25 13:07:20 TRACE nova.service     self.wait(sleep_time)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 84, in wait
  2012-07-25 13:07:20 TRACE nova.service     presult = self.do_poll(seconds)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 245, in _sigterm
  2012-07-25 13:07:20 TRACE nova.service     LOG.info(_('Received SIGTERM, stopping'))
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 1420, in info
  2012-07-25 13:07:20 TRACE nova.service     self.logger.info(msg, *args, **kwargs)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 1140, in info
  2012-07-25 13:07:20 TRACE nova.service     self._log(INFO, msg, args, **kwargs)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 1258, in _log
  2012-07-25 13:07:20 TRACE nova.service     self.handle(record)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 1268, in handle
  2012-07-25 13:07:20 TRACE nova.service     self.callHandlers(record)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 1308, in callHandlers
  2012-07-25 13:07:20 TRACE nova.service     hdlr.handle(record)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 746, in handle
  2012-07-25 13:07:20 TRACE nova.service     self.acquire()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/logging/__init__.py", line 697, in acquire
  2012-07-25 13:07:20 TRACE nova.service     self.lock.acquire()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/threading.py", line 127, in acquire
  2012-07-25 13:07:20 TRACE nova.service     rc = self.__block.acquire(blocking)
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/semaphore.py", line 71, in acquire
  2012-07-25 13:07:20 TRACE nova.service     hubs.get_hub().switch()
  2012-07-25 13:07:20 TRACE nova.service   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 162, in switch
  2012-07-25 13:07:20 TRACE nova.service     assert cur is not self.greenlet, 'Cannot switch to MAINLOOP from MAINLOOP'
  2012-07-25 13:07:20 TRACE nova.service AssertionError: Cannot switch to MAINLOOP from MAINLOOP
  2012-07-25 13:07:20 TRACE nova.service
  2012-07-25 13:07:22 INFO nova.service [-] Parent process has died unexpectedly, exiting

  I can't seem to reproduce manually, but I am in the process of adding
  debugging capabilities to the systems during provisioning, so I should
  be able to get gdb attached to one of the stuck process tomorrow.  Any
  clues in the meantime would be appreciated.

  I partly want to migrate to using individual API servers instead of
  the monolithic nova-api, but if nova-api remains it should be able to
  handle restarts robustly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1029727/+subscriptions