← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1778616] [NEW] DHCP agent block network-vif-plugged cause Instance failed to build

 

Public bug reported:

ENV:
Neutron stable/queens (12.0.1)
CentOS 7 (3.10.0-514.26.2.el7.x86_64)
Ceph v10.2.9 Jewel

How to reproduce:
Concurrently create 256 VMs in a single network which has 2 dhcp agents.

Exception:
nova-compute side:
2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325
2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds


neutron server log:
A failed nova boot, the neutron port only have one log:
2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73

It will never get the `Provisioning complete` notification.


Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP:
2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138


Code:
For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot.
For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428


Potential solution:
Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works.
But, such config is really not so good for the production envrionment.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1778616

Title:
  DHCP agent block network-vif-plugged cause Instance failed to build

Status in neutron:
  New

Bug description:
  ENV:
  Neutron stable/queens (12.0.1)
  CentOS 7 (3.10.0-514.26.2.el7.x86_64)
  Ceph v10.2.9 Jewel

  How to reproduce:
  Concurrently create 256 VMs in a single network which has 2 dhcp agents.

  Exception:
  nova-compute side:
  2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325
  2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds

  
  neutron server log:
  A failed nova boot, the neutron port only have one log:
  2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73

  It will never get the `Provisioning complete` notification.

  
  Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP:
  2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
  2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
  2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138


  Code:
  For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot.
  For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds.

  [1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424
  [2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428

  
  Potential solution:
  Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works.
  But, such config is really not so good for the production envrionment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1778616/+subscriptions