yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80517
[Bug 1849502] Re: [DHCP] Check the dnsmasq process status after enabling the process
Reviewed: https://review.opendev.org/690700
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Submitter: Zuul
Branch: master
commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date: Wed Oct 23 14:47:54 2019 +0000
Check dnsmasq process is active when spawned
After spawning the "dnsmasq" process in the method
"Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
process is running and could be detected by the ProcessManager instance
controlling it.
ProcessManager determines if a process is "active":
- If the network ID is in the cmdline used to execute the process.
- If the process is detected by psutil.Process(pid), returning the
cmdline needed in the first condition.
- If the PID file exists; this is written by the dnsmasq process
once is started and is needed in the second condition.
To make this feature available for any other process using
ProcessManager, the implementation is done in this class.
Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
Closes-Bug: #1849502
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1849502
Title:
[DHCP] Check the dnsmasq process status after enabling the process
Status in neutron:
Fix Released
Bug description:
Hello:
Version: Train (but also reproducible in master)
Environment: 1 network with 3 subnets, and one host for the DHCP agent.
We found a race condition between the DHCP agent driver call and the
dnsmasq process status.
When a deployment process starts, a network is created with three subnets. The subnet creation timestamp is:
```
server.log.3:1828:2019-10-21 22:29:44.648 24 DEBUG neutron_lib.callbacks.manager [req-91242e64-648f-45e8-a5eb-de18c9301ea0 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1875:2019-10-21 22:29:45.929 24 DEBUG neutron_lib.callbacks.manager [req-27ed89df-cbb0-4a3e-90db-208401560889 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1974:2019-10-21 22:29:47.846 24 DEBUG neutron_lib.callbacks.manager [req-113792ac-9649-443e-9ee3-102ea02f5bd5 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
```
As we can see in [1], the DHCP is informed about the different changes in the network and reacts depending on the current status. The first time, the DHCP process is enabled (no other process is running at this point). This happens at 22:29:45.340, between first subnet and second subnet creation. The network information at this point has only one subnet, retrieved by the DHCP agent in:
```
2019-10-21 22:29:45.152 27 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-535e0231-b69e-4136-8ce4-aad14c6fc9ac - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```
The next driver call is a "restart". This happens when 2nd and 3rd subnets are created. When the DHCP agent calls the server to retrieve the network information, this network has all three subnets:
```
2019-10-21 22:29:49.135 26 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```
What is happening in the "restart" process is a bit weird [2]. The process should be stopped first and then restarted with the new cmd line (including the new subnet dhcp-ranges --> the main problem detected in this bug). But the dnsmasq process is not running yet:
```
dhcp-agent.log.2:442:2019-10-21 22:29:49.417 57623 DEBUG neutron.agent.linux.external_process [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] No dnsmasq process started for 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:121
```
The DHCP agent started the dnsmasq process in:
```
2019-10-21 22:29:48.496 57623 DEBUG neutron.agent.linux.utils [req-9de39ee9-2b4f-44c1-be47-46922943dd17 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/host', '--addn-hosts=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1500', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
```
That means one second before the DHCP agent tried to stop it again.
To avoid this, when the DHCP agent starts the dnsmasq process, we need
to ensure that ProcessManager() check after enabling the process that
this process is active. If we ensure this, in the next agent call (and
driver call - dnsmasq), the process for a specific network is up and
running, and ProcessManager can detect the process is active.
Regards.
[1] http://paste.openstack.org/show/785519/
[2] http://paste.openstack.org/show/785520/
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1849502/+subscriptions
References