yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #01422
[Bug 1029727] Re: Using logging module from signal handler can wedge process
** Changed in: glance
Status: Fix Committed => Fix Released
** Changed in: glance
Milestone: None => grizzly-rc1
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1029727
Title:
Using logging module from signal handler can wedge process
Status in OpenStack Image Registry and Delivery Service (Glance):
Fix Released
Status in OpenStack Compute (Nova):
Fix Released
Bug description:
Running into an issue with nova-api which may be threading related.
Having lots of fun trying to reproduce and debug, but not having much
luck yet. Filing here to track my progress and hopefully solicit
input from others.
First some background... We're triggering multi-node test deployments
for upstream commits using Juju + MAAS. The process of deploying
services with Juju is very much event driven. A service is deployed,
and relations are added to other services. Each new relation added
fires a number of hook scripts on the machine that do various bits of
configuration. For most of the nova components, this involves making
a few modifications to nova.conf or paste config and restarting the
running services (using upstart).
In the case of nova-api, the service is deployed to an empty machine,
the packages install and the services are stopped. Then a number of
relations are added to the other services in the environment
(keystone, rabbit, mysql, glance, etc). The nova-api configuration
flow usually looks something like:
deploy -> install nova-api, stop service
mysql relation added -> set sql_connection, restart nova-api
rabbit realtion added -> update nova.conf's rabbit settings , restart nova-api
keystone relation added -> update paste.ini with credentials, restart nova-api
etc, etc, etc
This all happens in rapid succession. The hooks themselves are not
executed concurrently, but since nova doesn't fork and upstart
launches it in foreground, service restarts are basically
asynchronous. Depending on how quickly the hooks fire off, the service
may be stopped and started (restarted) while the previous process is
still going through its initialization. So, its likely that every
time the service is restarted, its in a different stage of init, with
different configuration.
It seems that nova-api being restarted in this manor is not reliable.
It's not 100% reproducible, but calls to stop nova-api will sometimes
leave an orphaned child, still bound to the port it was listening
(but unresponsive to connections, sitting in epoll_wait()) AFAICS,
the parent process is mistakenly getting killed off before its
children. At this point, nova-api is stopped as far as upstart is
concerned. The following hook will try to start nova-api and the
service will fail because the port is already in use. The only way to
recover is to manually kill the orphaned child and restart the
service.
I *was* seeing the following traceback in the api log whenever it hit
this state, but these have disappeared from the logs over the last day
or two though the problem still persists.
2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.service [-] Child 6519 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6518 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6517 exited with status 0
2012-07-25 13:07:20 ERROR nova.service [-] Unhandled exception
2012-07-25 13:07:20 TRACE nova.service Traceback (most recent call last):
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 289, in _start_child
2012-07-25 13:07:20 TRACE nova.service self._child_process(wrap.server)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 266, in _child_process
2012-07-25 13:07:20 TRACE nova.service launcher.run_server(server)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 127, in run_server
2012-07-25 13:07:20 TRACE nova.service server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 619, in wait
2012-07-25 13:07:20 TRACE nova.service self.server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/wsgi.py", line 113, in wait
2012-07-25 13:07:20 TRACE nova.service self._server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
2012-07-25 13:07:20 TRACE nova.service return self._exit_event.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2012-07-25 13:07:20 TRACE nova.service return hubs.get_hub().switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
2012-07-25 13:07:20 TRACE nova.service return self.greenlet.switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 226, in run
2012-07-25 13:07:20 TRACE nova.service self.wait(sleep_time)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 84, in wait
2012-07-25 13:07:20 TRACE nova.service presult = self.do_poll(seconds)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 245, in _sigterm
2012-07-25 13:07:20 TRACE nova.service LOG.info(_('Received SIGTERM, stopping'))
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1420, in info
2012-07-25 13:07:20 TRACE nova.service self.logger.info(msg, *args, **kwargs)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1140, in info
2012-07-25 13:07:20 TRACE nova.service self._log(INFO, msg, args, **kwargs)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1258, in _log
2012-07-25 13:07:20 TRACE nova.service self.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1268, in handle
2012-07-25 13:07:20 TRACE nova.service self.callHandlers(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1308, in callHandlers
2012-07-25 13:07:20 TRACE nova.service hdlr.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 746, in handle
2012-07-25 13:07:20 TRACE nova.service self.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 697, in acquire
2012-07-25 13:07:20 TRACE nova.service self.lock.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/threading.py", line 127, in acquire
2012-07-25 13:07:20 TRACE nova.service rc = self.__block.acquire(blocking)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/semaphore.py", line 71, in acquire
2012-07-25 13:07:20 TRACE nova.service hubs.get_hub().switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 162, in switch
2012-07-25 13:07:20 TRACE nova.service assert cur is not self.greenlet, 'Cannot switch to MAINLOOP from MAINLOOP'
2012-07-25 13:07:20 TRACE nova.service AssertionError: Cannot switch to MAINLOOP from MAINLOOP
2012-07-25 13:07:20 TRACE nova.service
2012-07-25 13:07:22 INFO nova.service [-] Parent process has died unexpectedly, exiting
I can't seem to reproduce manually, but I am in the process of adding
debugging capabilities to the systems during provisioning, so I should
be able to get gdb attached to one of the stuck process tomorrow. Any
clues in the meantime would be appreciated.
I partly want to migrate to using individual API servers instead of
the monolithic nova-api, but if nova-api remains it should be able to
handle restarts robustly.
To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1029727/+subscriptions