← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1364876] Re: Specifying both rpc_workers and api_workers make stoping neutron-server fail

 

** Also affects: oslo.service
   Importance: Undecided
       Status: New

** Changed in: neutron
     Assignee: Li Ma (nick-ma-z) => (unassigned)

** Changed in: oslo.service
     Assignee: (unassigned) => Li Ma (nick-ma-z)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1364876

Title:
  Specifying both rpc_workers and api_workers make stoping neutron-
  server fail

Status in neutron:
  In Progress
Status in oslo-incubator:
  Invalid
Status in oslo.service:
  New

Bug description:
  Hi,

  By setting both rpc_workers and api_workers to something bigger than
  1, when you try to stop the service with e.g. upstart the stop doesn't
  kill all neutron-server processes, which result to failure when
  starting neutron-server again.

  Details:
  ======

  neutron-server will create 2 openstack.common.service.ProcessLauncher
  instances one the RPC service, the other for the WSGI API service, now
  the ProcessLauncher wasn't meant to be instantiated more than once in
  a single process and here is why:

  1. Each ProcessLauncher instance register a callback to catch signals like SIGTERM, SIGINT and SIGHUB, having two instances of ProcessLauncher mean signal.signal will be called twice with different callbacks, only the last one executed will take effect, i.e. Only one ProcessLauncher instance will
  catch the signal and do the cleaning.

  2. Each ProcessLauncher think that he own all children processes of
  the parent process, for example take a look at "_wait_child" method
  that will catch all killed children processes i.e. os.waitpid(0, ... .

  3. When only one ProcessLauncher instance is handling the process
  termination while the other doesn't (Point 1), this lead to race
  condition between both:

      3.1. Running "stop neutron-server" will kill also children
  processes too, but b/c we have 2 ProcessLauncher the one that didn't
  catch the kill signal will keep respawning new children processes when
  it detect that a child process died, the other wont because
  self.running was set to False.

      3.2. When children processes dies (i.e. stop neutron-server), one
  of the ProcessLauncher will catch that with os.waitpid(0, os.WNOHANG)
  (both do that), and if the death of a child process is catched by the
  wrong ProcessLauncher i.e. not the one that has it in his
  self.children instance variable, the parent process will hang forever
  in the loop below b/c self.children will always contain that child
  process:

       if self.children:
              LOG.info(_LI('Waiting on %d children to exit'), len(self.children))
              while self.children:
                  self._wait_child()

      3.3. When a child process die if his death is catch by the wrong
  ProcessLauncher instance (i.e. not the one that have in in it's
  seld.children) then a replacement will never be spawned.

  Hopefully I made this clear.

  Cheers,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1364876/+subscriptions


References