← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1849676] [NEW] DHCP agents time out during startup at 60s when there is enough agents

 

Public bug reported:


The following introduces a 60s timeout to DHCP agent startups:

https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a
#diff-3fcbcfeebb7de79a1cb36faed9b8b091

The value is not adjustable from conf.


When there's enough network elements (ie. ~1200 DHCP enabled subnets in our case), nearly all DHCP startups fail with:

2019-10-09 13:21:27.826 694156 ERROR neutron.agent.linux.dhcp [-] Failed
to start DHCP process for network 8b4b5496-8b35-482e-a2a3-7c352f1e343a:
WaitTimeout: Timed out after 60 seconds


Timeout happens due to operations happening in sequence with 100-300ms
between each operation, and too many agents tried at the same time:

2019-10-09 12:24:17.836 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-13648243-5659-4094-9dd7-cee58e4d46ac', 'ip', '-o', 'link', 'show', 'tap3af1ee05-2f'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.075 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1407a320-8fca-4dfd-a011-96a2ad41779f', 'ip', '-o', 'link', 'show', 'tap21c443e1-89'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.266 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1412976d-4cd0-452a-91b7-7f8c3003c722', 'ip', '-o', 'link', 'show', 'tap1cb0995e-bd'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.541 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1433d8a6-fa06-4544-994a-d38b01302490', 'ip', '-o', 'link', 'show', 'tap00fcd6f1-b1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.735 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1447a8cf-e94c-4250-b9a8-2b13c0cf60c6', 'ip', '-o', 'link', 'show', 'tap91076dbc-83'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.930 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-14b396e1-d561-4087-990d-9b993cc08619', 'ip', '-o', 'link', 'show', 'tapdf767a28-af'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103


The following allows the agents to start:

- common_utils.wait_until_true(self._enable)
+ common_utils.wait_until_true(self._enable, timeout=300)


Few ways to solve this issue:
- Increase default timeout from 60s to a bigger number
- Make the timeout dhcp conf adjustable
- Figure out more optimal batch sizes of DHCP agents to be started at a time, and increase the startup performance

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1849676

Title:
  DHCP agents time out during startup at 60s when there is enough agents

Status in neutron:
  New

Bug description:
  
  The following introduces a 60s timeout to DHCP agent startups:

  https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a
  #diff-3fcbcfeebb7de79a1cb36faed9b8b091

  The value is not adjustable from conf.

  
  When there's enough network elements (ie. ~1200 DHCP enabled subnets in our case), nearly all DHCP startups fail with:

  2019-10-09 13:21:27.826 694156 ERROR neutron.agent.linux.dhcp [-]
  Failed to start DHCP process for network 8b4b5496-8b35-482e-
  a2a3-7c352f1e343a: WaitTimeout: Timed out after 60 seconds


  Timeout happens due to operations happening in sequence with 100-300ms
  between each operation, and too many agents tried at the same time:

  2019-10-09 12:24:17.836 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-13648243-5659-4094-9dd7-cee58e4d46ac', 'ip', '-o', 'link', 'show', 'tap3af1ee05-2f'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.075 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1407a320-8fca-4dfd-a011-96a2ad41779f', 'ip', '-o', 'link', 'show', 'tap21c443e1-89'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.266 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1412976d-4cd0-452a-91b7-7f8c3003c722', 'ip', '-o', 'link', 'show', 'tap1cb0995e-bd'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.541 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1433d8a6-fa06-4544-994a-d38b01302490', 'ip', '-o', 'link', 'show', 'tap00fcd6f1-b1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.735 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1447a8cf-e94c-4250-b9a8-2b13c0cf60c6', 'ip', '-o', 'link', 'show', 'tap91076dbc-83'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
  2019-10-09 12:24:18.930 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-14b396e1-d561-4087-990d-9b993cc08619', 'ip', '-o', 'link', 'show', 'tapdf767a28-af'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103


  The following allows the agents to start:

  - common_utils.wait_until_true(self._enable)
  + common_utils.wait_until_true(self._enable, timeout=300)

  
  Few ways to solve this issue:
  - Increase default timeout from 60s to a bigger number
  - Make the timeout dhcp conf adjustable
  - Figure out more optimal batch sizes of DHCP agents to be started at a time, and increase the startup performance

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1849676/+subscriptions


Follow ups