← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1364876] [NEW] Specifying both rpc_workers and api_workers make stoping neutron-server fail

 

Public bug reported:

Hi,

By setting both rpc_workers and api_workers to something bigger than 1,
when you try to stop the service with e.g. upstart the stop doesn't kill
all neutron-server processes, which result to failure when starting
neutron-server again.

Details:
======

neutron-server will create to openstack.common.service.ProcessLauncher
instances one for each service i.e. rpc and api, now the ProcessLauncher
wasn't meant to be instantiated more than once in a single process and
here is why:

1. Each ProcessLauncher instance register a callback to catch signals
like SIGTERM, SIGINT and SIGHUB, having two instances of ProcessLauncher
mean the signal.signal will be called twice with different callbacks,
only the last one executed will take effect.

2. Each ProcessLauncher think that he own all children processes of the
current process, for example take a look at "_wait_child" method that
will catch all killed child processes.

3. When only one ProcessLauncher instance is handling the process
termination while the other doesn't this lead to race condition between
both:

    3.1. Running "stop neutron-server" will kill also children processes
too, but b/c we have 2 ProcessLauncher the one that didn't catch the
kill signal will keep respawning new children processes when it detect
that they died, the other want because self.running was set to False.

    3.2. When children processes dies (i.e. stop neutron-server), one of
the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG) (both
do that), and if the death of a child process is catched by the wrong
ProcessLauncher i.e. not the one that has it in his children instance
variable, the parent process will hang forever in this loop b/c
self.children will always contain that child process:

     if self.children:
            LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
            while self.children:
                self._wait_child()

Hopefully I made this clear.

Cheers,

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1364876

Title:
  Specifying both rpc_workers and api_workers make stoping neutron-
  server fail

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  Hi,

  By setting both rpc_workers and api_workers to something bigger than
  1, when you try to stop the service with e.g. upstart the stop doesn't
  kill all neutron-server processes, which result to failure when
  starting neutron-server again.

  Details:
  ======

  neutron-server will create to openstack.common.service.ProcessLauncher
  instances one for each service i.e. rpc and api, now the
  ProcessLauncher wasn't meant to be instantiated more than once in a
  single process and here is why:

  1. Each ProcessLauncher instance register a callback to catch signals
  like SIGTERM, SIGINT and SIGHUB, having two instances of
  ProcessLauncher mean the signal.signal will be called twice with
  different callbacks, only the last one executed will take effect.

  2. Each ProcessLauncher think that he own all children processes of
  the current process, for example take a look at "_wait_child" method
  that will catch all killed child processes.

  3. When only one ProcessLauncher instance is handling the process
  termination while the other doesn't this lead to race condition
  between both:

      3.1. Running "stop neutron-server" will kill also children
  processes too, but b/c we have 2 ProcessLauncher the one that didn't
  catch the kill signal will keep respawning new children processes when
  it detect that they died, the other want because self.running was set
  to False.

      3.2. When children processes dies (i.e. stop neutron-server), one
  of the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG)
  (both do that), and if the death of a child process is catched by the
  wrong ProcessLauncher i.e. not the one that has it in his children
  instance variable, the parent process will hang forever in this loop
  b/c self.children will always contain that child process:

       if self.children:
              LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
              while self.children:
                  self._wait_child()

  Hopefully I made this clear.

  Cheers,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1364876/+subscriptions


Follow ups

References