yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #92390
[Bug 2020215] [NEW] ml2/ovn refuses to bind port due to dead agent randomly in the nova-live-migrate ci job
Public bug reported:
we have seen random failures of
test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]
in the nova-live-migaration job with the following error
Details: {'code': 400, 'message': 'Migration pre-check error: Binding
failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
neutron logs for more information.'}
looking at the neuton log we see
May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
service neutron] Refusing to bind port
e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:
<neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent
object at 0x7f6a7a6d2950>
May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind
port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for
vnic_type normal using segments [{'id':
'1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
'physical_network': None, 'segmentation_id': 525, 'network_id':
'745f0724-2779-4d60-845c-8f673d567d0d'}]
and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too.
May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
table for 10 seconds {{(pid=38857) run
/opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}
This looks like it might be related to
https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e
This modified the code to add some randomness due to
https://bugs.launchpad.net/neutron/+bug/1991817
but that seams to negitivly impact the stability of the agent.
to fix this i will propose a patch to change the interval form
interval = randint(0, cfg.CONF.agent_down_time // 2)
to
interval = randint(0, cfg.CONF.agent_down_time // 3)
to increase the likelihood that we send the heartbeat in time.
when we are making calls to privsep and ovs the logs stop for multiple
second while those operations are happening and if that happens the the
wrong time i belive this leads to use missing the heartbeat interval.
** Affects: neutron
Importance: Undecided
Assignee: sean mooney (sean-k-mooney)
Status: New
** Changed in: neutron
Assignee: (unassigned) => sean mooney (sean-k-mooney)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2020215
Title:
ml2/ovn refuses to bind port due to dead agent randomly in the nova-
live-migrate ci job
Status in neutron:
New
Bug description:
we have seen random failures of
test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]
in the nova-live-migaration job with the following error
Details: {'code': 400, 'message': 'Migration pre-check error: Binding
failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
neutron logs for more information.'}
looking at the neuton log we see
May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
service neutron] Refusing to bind port
e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:
<neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent
object at 0x7f6a7a6d2950>
May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to
bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853
for vnic_type normal using segments [{'id':
'1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
'physical_network': None, 'segmentation_id': 525, 'network_id':
'745f0724-2779-4d60-845c-8f673d567d0d'}]
and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too.
May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
table for 10 seconds {{(pid=38857) run
/opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}
This looks like it might be related to
https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e
This modified the code to add some randomness due to
https://bugs.launchpad.net/neutron/+bug/1991817
but that seams to negitivly impact the stability of the agent.
to fix this i will propose a patch to change the interval form
interval = randint(0, cfg.CONF.agent_down_time // 2)
to
interval = randint(0, cfg.CONF.agent_down_time // 3)
to increase the likelihood that we send the heartbeat in time.
when we are making calls to privsep and ovs the logs stop for multiple
second while those operations are happening and if that happens the
the wrong time i belive this leads to use missing the heartbeat
interval.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2020215/+subscriptions