← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1780139] Re: Sending SIGHUP to neutron-server process causes it to hang

 

** Also affects: tripleo
   Importance: Undecided
       Status: New

** Changed in: tripleo
       Status: New => Triaged

** Changed in: tripleo
   Importance: Undecided => Critical

** Changed in: tripleo
    Milestone: None => rocky-3

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1780139

Title:
  Sending SIGHUP to neutron-server process causes it to hang

Status in neutron:
  Triaged
Status in tripleo:
  Triaged

Bug description:
  * High level description

  When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist:
  wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)

  This is problematic, because logrotate uses SIGHUP in the
  containerized environment. It doesn't happen always: it might take one
  or two signals reasonably interspersed, to trigger this.

  * Pre-conditions:

  I'm using CentOS 7 + Queens RDO
  "rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"

  I first noticed the issue when the neutron_api docker would stop
  working on the control nodes, eventually it was traced back to the
  logrotate_crond container sending SIGHUP to all the processes owning
  log files in /var/log/containers. This doesn't happen every time, but
  it's pretty easy to trigger on my system.

  * Step-by-step reproduction steps:

  # Start with a clean container

  docker restart neutron_api

  # Identify the neutron-server PID: (613782 in this case) and send
  SIGHUP

  kill -HUP 613782

  # the relevant log files generally look clean the first time:

  2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
  2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
  2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0
  2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list
  2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0
  2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0
  2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0
  2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0
  2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0
  2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0
  2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0
  2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0
  2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696
  2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list
  2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696

  # But on the second SIGHUP, the following happened:

  2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
  2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up.
  2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True
  2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit
  2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0
  2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0
  2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0

  * Version

  I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config
  available on request.

  This corresponds to OpenStack Queens.

  I have not fully investigated all other services, but they seem to be
  fine with SIGHUP. In particular nova-api-metadata seems to be OK and
  doing what's expected.

  * Perceived severity

  This blocks our private production stack from running: neutron api
  breaking has a big knock-on effect, unfortunately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions


References