yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73496
[Bug 1778616] [NEW] DHCP agent block network-vif-plugged cause Instance failed to build
Public bug reported:
ENV:
Neutron stable/queens (12.0.1)
CentOS 7 (3.10.0-514.26.2.el7.x86_64)
Ceph v10.2.9 Jewel
How to reproduce:
Concurrently create 256 VMs in a single network which has 2 dhcp agents.
Exception:
nova-compute side:
2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325
2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds
neutron server log:
A failed nova boot, the neutron port only have one log:
2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
It will never get the `Provisioning complete` notification.
Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP:
2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138
Code:
For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot.
For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds.
[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428
Potential solution:
Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works.
But, such config is really not so good for the production envrionment.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1778616
Title:
DHCP agent block network-vif-plugged cause Instance failed to build
Status in neutron:
New
Bug description:
ENV:
Neutron stable/queens (12.0.1)
CentOS 7 (3.10.0-514.26.2.el7.x86_64)
Ceph v10.2.9 Jewel
How to reproduce:
Concurrently create 256 VMs in a single network which has 2 dhcp agents.
Exception:
nova-compute side:
2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325
2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds
neutron server log:
A failed nova boot, the neutron port only have one log:
2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
It will never get the `Provisioning complete` notification.
Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP:
2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138
Code:
For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot.
For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds.
[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428
Potential solution:
Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works.
But, such config is really not so good for the production envrionment.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1778616/+subscriptions