← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1364876] Re: Specifying both rpc_workers and api_workers make stoping neutron-server fail

 

Hi,

Can you be more specific as to what did you set for rpc_workers and
api_workers configuration values ?

FWIW The type of the bug is a race condition, so you must do it more
than once to see the problem happen, as I explained in the bug
description, you can also check the logs and whenever you see a warning
message like "pid .. not in child list", then you know that stuff are
already going south.

Cheers,

** Also affects: oslo-incubator
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1364876

Title:
  Specifying both rpc_workers and api_workers make stoping neutron-
  server fail

Status in OpenStack Neutron (virtual network service):
  New
Status in The Oslo library incubator:
  New

Bug description:
  Hi,

  By setting both rpc_workers and api_workers to something bigger than
  1, when you try to stop the service with e.g. upstart the stop doesn't
  kill all neutron-server processes, which result to failure when
  starting neutron-server again.

  Details:
  ======

  neutron-server will create 2 openstack.common.service.ProcessLauncher
  instances one the RPC service, the other for the WSGI API service, now
  the ProcessLauncher wasn't meant to be instantiated more than once in
  a single process and here is why:

  1. Each ProcessLauncher instance register a callback to catch signals like SIGTERM, SIGINT and SIGHUB, having two instances of ProcessLauncher mean signal.signal will be called twice with different callbacks, only the last one executed will take effect, i.e. Only one ProcessLauncher instance will
  catch the signal and do the cleaning.

  2. Each ProcessLauncher think that he own all children processes of
  the parent process, for example take a look at "_wait_child" method
  that will catch all killed children processes i.e. os.waitpid(0, ... .

  3. When only one ProcessLauncher instance is handling the process
  termination while the other doesn't (Point 1), this lead to race
  condition between both:

      3.1. Running "stop neutron-server" will kill also children
  processes too, but b/c we have 2 ProcessLauncher the one that didn't
  catch the kill signal will keep respawning new children processes when
  it detect that a child process died, the other wont because
  self.running was set to False.

      3.2. When children processes dies (i.e. stop neutron-server), one
  of the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG)
  (both do that), and if the death of a child process is catched by the
  wrong ProcessLauncher i.e. not the one that has it in his
  self.children instance variable, the parent process will hang forever
  in the loop below b/c self.children will always contain that child
  process:

       if self.children:
              LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
              while self.children:
                  self._wait_child()

      3.3. When a child process die if his death is catch by the wrong
  ProcessLauncher instance (i.e. not the one that have in in it's
  seld.children) then a replacement will never be spawned.

  Hopefully I made this clear.

  Cheers,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1364876/+subscriptions


References