yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1364876] Re: Specifying both rpc_workers and api_workers make stoping neutron-server fail

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Elena Ezhova <eezhova@xxxxxxxxxxxx>
Date: Mon, 24 Aug 2015 16:36:37 -0000
Reply-to: Bug 1364876 <1364876@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

This problem was fixed when service code was a part of oslo-incubator and is no longer observed.
Related bug: https://bugs.launchpad.net/neutron/+bug/1432995

** Changed in: oslo.service
Status: New => Invalid

** Changed in: neutron
Status: In Progress => Invalid

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1364876

Title:
Specifying both rpc_workers and api_workers make stoping neutron-
server fail

Status in neutron:
Invalid
Status in oslo-incubator:
Invalid
Status in oslo.service:
Invalid

Bug description:
Hi,

By setting both rpc_workers and api_workers to something bigger than
1, when you try to stop the service with e.g. upstart the stop doesn't
kill all neutron-server processes, which result to failure when
starting neutron-server again.

Details:
======

neutron-server will create 2 openstack.common.service.ProcessLauncher
instances one the RPC service, the other for the WSGI API service, now
the ProcessLauncher wasn't meant to be instantiated more than once in
a single process and here is why:

1. Each ProcessLauncher instance register a callback to catch signals like SIGTERM, SIGINT and SIGHUB, having two instances of ProcessLauncher mean signal.signal will be called twice with different callbacks, only the last one executed will take effect, i.e. Only one ProcessLauncher instance will
catch the signal and do the cleaning.

2. Each ProcessLauncher think that he own all children processes of
the parent process, for example take a look at "_wait_child" method
that will catch all killed children processes i.e. os.waitpid(0, ... .

3. When only one ProcessLauncher instance is handling the process
termination while the other doesn't (Point 1), this lead to race
condition between both:

3.1. Running "stop neutron-server" will kill also children
processes too, but b/c we have 2 ProcessLauncher the one that didn't
catch the kill signal will keep respawning new children processes when
it detect that a child process died, the other wont because
self.running was set to False.

3.2. When children processes dies (i.e. stop neutron-server), one
of the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG)
(both do that), and if the death of a child process is catched by the
wrong ProcessLauncher i.e. not the one that has it in his
self.children instance variable, the parent process will hang forever
in the loop below b/c self.children will always contain that child
process:

if self.children:
LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
while self.children:
self._wait_child()

3.3. When a child process die if his death is catch by the wrong
ProcessLauncher instance (i.e. not the one that have in in it's
seld.children) then a replacement will never be spawned.

Hopefully I made this clear.

Cheers,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1364876/+subscriptions

References

[Bug 1364876] [NEW] Specifying both rpc_workers and api_workers make stoping neutron-server fail
From: mouadino, 2014-09-03