yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1364876] Re: Specifying both rpc_workers and api_workers make stoping neutron-server fail

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: mouadino <1364876@xxxxxxxxxxxxxxxxxx>
Date: Thu, 11 Sep 2014 09:57:12 -0000
Reply-to: Bug 1364876 <1364876@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Hi,

Can you be more specific as to what did you set for rpc_workers and
api_workers configuration values ?

FWIW The type of the bug is a race condition, so you must do it more
than once to see the problem happen, as I explained in the bug
description, you can also check the logs and whenever you see a warning
message like "pid .. not in child list", then you know that stuff are
already going south.

Cheers,

** Also affects: oslo-incubator
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1364876

Title:
Specifying both rpc_workers and api_workers make stoping neutron-
server fail

Status in OpenStack Neutron (virtual network service):
New
Status in The Oslo library incubator:
New

Bug description:
Hi,

By setting both rpc_workers and api_workers to something bigger than
1, when you try to stop the service with e.g. upstart the stop doesn't
kill all neutron-server processes, which result to failure when
starting neutron-server again.

Details:
======

neutron-server will create 2 openstack.common.service.ProcessLauncher
instances one the RPC service, the other for the WSGI API service, now
the ProcessLauncher wasn't meant to be instantiated more than once in
a single process and here is why:

1. Each ProcessLauncher instance register a callback to catch signals like SIGTERM, SIGINT and SIGHUB, having two instances of ProcessLauncher mean signal.signal will be called twice with different callbacks, only the last one executed will take effect, i.e. Only one ProcessLauncher instance will
catch the signal and do the cleaning.

2. Each ProcessLauncher think that he own all children processes of
the parent process, for example take a look at "_wait_child" method
that will catch all killed children processes i.e. os.waitpid(0, ... .

3. When only one ProcessLauncher instance is handling the process
termination while the other doesn't (Point 1), this lead to race
condition between both:

3.1. Running "stop neutron-server" will kill also children
processes too, but b/c we have 2 ProcessLauncher the one that didn't
catch the kill signal will keep respawning new children processes when
it detect that a child process died, the other wont because
self.running was set to False.

3.2. When children processes dies (i.e. stop neutron-server), one
of the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG)
(both do that), and if the death of a child process is catched by the
wrong ProcessLauncher i.e. not the one that has it in his
self.children instance variable, the parent process will hang forever
in the loop below b/c self.children will always contain that child
process:

if self.children:
LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
while self.children:
self._wait_child()

3.3. When a child process die if his death is catch by the wrong
ProcessLauncher instance (i.e. not the one that have in in it's
seld.children) then a replacement will never be spawned.

Hopefully I made this clear.

Cheers,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1364876/+subscriptions

References

[Bug 1364876] [NEW] Specifying both rpc_workers and api_workers make stoping neutron-server fail
From: mouadino, 2014-09-03