yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73780
[Bug 1780139] Re: Sending SIGHUP to neutron-server process causes it to hang
** Also affects: tripleo
Importance: Undecided
Status: New
** Changed in: tripleo
Status: New => Triaged
** Changed in: tripleo
Importance: Undecided => Critical
** Changed in: tripleo
Milestone: None => rocky-3
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1780139
Title:
Sending SIGHUP to neutron-server process causes it to hang
Status in neutron:
Triaged
Status in tripleo:
Triaged
Bug description:
* High level description
When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist:
wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
This is problematic, because logrotate uses SIGHUP in the
containerized environment. It doesn't happen always: it might take one
or two signals reasonably interspersed, to trigger this.
* Pre-conditions:
I'm using CentOS 7 + Queens RDO
"rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"
I first noticed the issue when the neutron_api docker would stop
working on the control nodes, eventually it was traced back to the
logrotate_crond container sending SIGHUP to all the processes owning
log files in /var/log/containers. This doesn't happen every time, but
it's pretty easy to trigger on my system.
* Step-by-step reproduction steps:
# Start with a clean container
docker restart neutron_api
# Identify the neutron-server PID: (613782 in this case) and send
SIGHUP
kill -HUP 613782
# the relevant log files generally look clean the first time:
2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True
2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0
2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list
2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0
2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0
2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0
2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0
2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0
2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0
2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0
2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0
2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696
2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list
2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696
# But on the second SIGHUP, the following happened:
2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up.
2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True
2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True
2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit
2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process
2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0
2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0
2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0
* Version
I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config
available on request.
This corresponds to OpenStack Queens.
I have not fully investigated all other services, but they seem to be
fine with SIGHUP. In particular nova-api-metadata seems to be OK and
doing what's expected.
* Perceived severity
This blocks our private production stack from running: neutron api
breaking has a big knock-on effect, unfortunately.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions
References