← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1849676] Re: DHCP agents time out during startup at 60s when there is enough agents

 

Reviewed:  https://review.opendev.org/692160
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aedc09917680bc77501f4b57142e73768e3d06ad
Submitter: Zuul
Branch:    master

commit aedc09917680bc77501f4b57142e73768e3d06ad
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date:   Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement
    
    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".
    
    This patch increases this time by a reasonable 300 seconds.
    
    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757
    
    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676


** Changed in: neutron
       Status: In Progress => Fix Released

** Bug watch added: Python Roundup #35757
   http://bugs.python.org/issue35757

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1849676

Title:
  DHCP agents time out during startup at 60s when there is enough agents

Status in neutron:
  Fix Released

Bug description:
  
  The following introduces a 60s timeout to DHCP agent startups:

  https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a
  #diff-3fcbcfeebb7de79a1cb36faed9b8b091

  The value is not adjustable from conf.

  
  When there's enough network elements (ie. ~1200 DHCP enabled subnets in our case), nearly all DHCP startups fail with:

  2019-10-09 13:21:27.826 694156 ERROR neutron.agent.linux.dhcp [-]
  Failed to start DHCP process for network 8b4b5496-8b35-482e-
  a2a3-7c352f1e343a: WaitTimeout: Timed out after 60 seconds


  Timeout happens due to operations happening in sequence with 100-300ms
  between each operation, and too many agents tried at the same time:

  2019-10-09 12:24:17.836 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-13648243-5659-4094-9dd7-cee58e4d46ac', 'ip', '-o', 'link', 'show', 'tap3af1ee05-2f'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.075 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1407a320-8fca-4dfd-a011-96a2ad41779f', 'ip', '-o', 'link', 'show', 'tap21c443e1-89'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.266 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1412976d-4cd0-452a-91b7-7f8c3003c722', 'ip', '-o', 'link', 'show', 'tap1cb0995e-bd'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.541 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1433d8a6-fa06-4544-994a-d38b01302490', 'ip', '-o', 'link', 'show', 'tap00fcd6f1-b1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.735 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1447a8cf-e94c-4250-b9a8-2b13c0cf60c6', 'ip', '-o', 'link', 'show', 'tap91076dbc-83'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.930 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-14b396e1-d561-4087-990d-9b993cc08619', 'ip', '-o', 'link', 'show', 'tapdf767a28-af'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103


  The following allows the agents to start:

  - common_utils.wait_until_true(self._enable)
  + common_utils.wait_until_true(self._enable, timeout=300)

  
  Few ways to solve this issue:
  - Increase default timeout from 60s to a bigger number
  - Make the timeout dhcp conf adjustable
  - Figure out more optimal batch sizes of DHCP agents to be started at a time, and increase the startup performance

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1849676/+subscriptions


References