yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81645
[Bug 1780139] Re: Sending SIGHUP to neutron-server process causes it to hang
Reviewed: https://review.opendev.org/706514
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=13aa00026f768642f385259ba7ccddcfdad3c75f
Submitter: Zuul
Branch: master
commit 13aa00026f768642f385259ba7ccddcfdad3c75f
Author: Bernard Cafarelli <bcafarel@xxxxxxxxxx>
Date: Fri Feb 7 14:21:09 2020 +0100
Re-use existing ProcessLauncher from wsgi in RPC workers
If both are run under the same process, and api_workers >= 2, the server
process will instantiate two oslo_service.ProcessLauncher instances
This should be avoided [0], and indeed causes issues on subprocess and
signal handling: killed RPC workers not respawning, SIGHUP on master
process leading to unresponsive server, signal not properly sent to all
child processes, ...
To avoid this, use the wsgi ProcessLauncher instance if it exists
[0] https://docs.openstack.org/oslo.service/latest/user/usage.html#launchers
Change-Id: Ic821f8ca84add9c8137ef712031afb43e491591c
Closes-Bug: #1780139
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1780139
Title:
Sending SIGHUP to neutron-server process causes it to hang
Status in neutron:
Fix Released
Status in tripleo:
Won't Fix
Bug description:
* High level description
When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist:
wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
This is problematic, because logrotate uses SIGHUP in the
containerized environment. It doesn't happen always: it might take one
or two signals reasonably interspersed, to trigger this.
* Pre-conditions:
I'm using CentOS 7 + Queens RDO
"rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"
I first noticed the issue when the neutron_api docker would stop
working on the control nodes, eventually it was traced back to the
logrotate_crond container sending SIGHUP to all the processes owning
log files in /var/log/containers. This doesn't happen every time, but
it's pretty easy to trigger on my system.
* Step-by-step reproduction steps:
# Start with a clean container
docker restart neutron_api
# Identify the neutron-server PID: (613782 in this case) and send
SIGHUP
kill -HUP 613782
# the relevant log files generally look clean the first time:
2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0
2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list
2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0
2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0
2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0
2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0
2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0
2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0
2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0
2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0
2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list
2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696
# But on the second SIGHUP, the following happened:
2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up.
2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True
2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit
2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0
2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0
2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0
* Version
I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config
available on request.
This corresponds to OpenStack Queens.
I have not fully investigated all other services, but they seem to be
fine with SIGHUP. In particular nova-api-metadata seems to be OK and
doing what's expected.
* Perceived severity
This blocks our private production stack from running: neutron api
breaking has a big knock-on effect, unfortunately.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions
References