← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1849502] Re: [DHCP] Check the dnsmasq process status after enabling the process

 

Reviewed:  https://review.opendev.org/690700
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Submitter: Zuul
Branch:    master

commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date:   Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned
    
    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.
    
    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.
    
    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.
    
    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1849502

Title:
  [DHCP] Check the dnsmasq process status after enabling the process

Status in neutron:
  Fix Released

Bug description:
  Hello:

  Version: Train (but also reproducible in master)
  Environment: 1 network with 3 subnets, and one host for the DHCP agent.

  We found a race condition between the DHCP agent driver call and the
  dnsmasq process status.

  When a deployment process starts, a network is created with three subnets. The subnet creation timestamp is:
  ```
  server.log.3:1828:2019-10-21 22:29:44.648 24 DEBUG neutron_lib.callbacks.manager [req-91242e64-648f-45e8-a5eb-de18c9301ea0 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
  server.log.3:1875:2019-10-21 22:29:45.929 24 DEBUG neutron_lib.callbacks.manager [req-27ed89df-cbb0-4a3e-90db-208401560889 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
  server.log.3:1974:2019-10-21 22:29:47.846 24 DEBUG neutron_lib.callbacks.manager [req-113792ac-9649-443e-9ee3-102ea02f5bd5 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
  ```

  As we can see in [1], the DHCP is informed about the different changes in the network and reacts depending on the current status. The first time, the DHCP process is enabled (no other process is running at this point). This happens at 22:29:45.340, between first subnet and second subnet creation. The network information at this point has only one subnet, retrieved by the DHCP agent in:
  ```
  2019-10-21 22:29:45.152 27 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-535e0231-b69e-4136-8ce4-aad14c6fc9ac - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
  ```

  The next driver call is a "restart". This happens when 2nd and 3rd subnets are created. When the DHCP agent calls the server to retrieve the network information, this network has all three subnets:
  ```
  2019-10-21 22:29:49.135 26 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
  ```

  What is happening in the "restart" process is a bit weird [2]. The process should be stopped first and then restarted with the new cmd line (including the new subnet dhcp-ranges --> the main problem detected in this bug). But the dnsmasq process is not running yet:
  ```
  dhcp-agent.log.2:442:2019-10-21 22:29:49.417 57623 DEBUG neutron.agent.linux.external_process [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] No dnsmasq process started for 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:121
  ```

  The DHCP agent started the dnsmasq process in:
  ```
  2019-10-21 22:29:48.496 57623 DEBUG neutron.agent.linux.utils [req-9de39ee9-2b4f-44c1-be47-46922943dd17 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/host', '--addn-hosts=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1500', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
  ```

  That means one second before the DHCP agent tried to stop it again.

  To avoid this, when the DHCP agent starts the dnsmasq process, we need
  to ensure that ProcessManager() check after enabling the process that
  this process is active. If we ensure this, in the next agent call (and
  driver call - dnsmasq), the process for a specific network is up and
  running, and ProcessManager can detect the process is active.

  Regards.

  [1] http://paste.openstack.org/show/785519/
  [2] http://paste.openstack.org/show/785520/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1849502/+subscriptions


References