← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2020215] [NEW] ml2/ovn refuses to bind port due to dead agent randomly in the nova-live-migrate ci job

 

Public bug reported:

we have seen random failures of

test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]

in the nova-live-migaration job with the following error

Details: {'code': 400, 'message': 'Migration pre-check error: Binding
failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
neutron logs for more information.'}


looking at the neuton log we see 

May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
service neutron] Refusing to bind port
e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:
<neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent
object at 0x7f6a7a6d2950>

May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind
port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for
vnic_type normal using segments [{'id':
'1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
'physical_network': None, 'segmentation_id': 525, 'network_id':
'745f0724-2779-4d60-845c-8f673d567d0d'}]


and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too.

May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
table for 10 seconds {{(pid=38857) run
/opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}

This looks like it might be related to

https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e

This modified the code to add some randomness due to
https://bugs.launchpad.net/neutron/+bug/1991817

but that seams to negitivly impact the stability of the agent.

to fix this i will propose a patch to change the interval form

interval = randint(0, cfg.CONF.agent_down_time // 2)

to

interval = randint(0, cfg.CONF.agent_down_time // 3)

to increase the likelihood that we send the heartbeat in time.

when we are making calls to privsep and ovs the logs stop for multiple
second while those operations are happening and if that happens the the
wrong time  i belive this leads to use missing the heartbeat interval.

** Affects: neutron
     Importance: Undecided
     Assignee: sean mooney (sean-k-mooney)
         Status: New

** Changed in: neutron
     Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2020215

Title:
  ml2/ovn refuses to bind port due to dead agent randomly in the nova-
  live-migrate ci job

Status in neutron:
  New

Bug description:
  we have seen random failures of

  test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]

  in the nova-live-migaration job with the following error

  Details: {'code': 400, 'message': 'Migration pre-check error: Binding
  failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
  neutron logs for more information.'}

  
  looking at the neuton log we see 

  May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
  neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
  ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
  service neutron] Refusing to bind port
  e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:
  <neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent
  object at 0x7f6a7a6d2950>

  May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
  neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
  req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to
  bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853
  for vnic_type normal using segments [{'id':
  '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
  'physical_network': None, 'segmentation_id': 525, 'network_id':
  '745f0724-2779-4d60-845c-8f673d567d0d'}]

  
  and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too.

  May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
  DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
  table for 10 seconds {{(pid=38857) run
  /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}

  This looks like it might be related to

  https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e

  This modified the code to add some randomness due to
  https://bugs.launchpad.net/neutron/+bug/1991817

  but that seams to negitivly impact the stability of the agent.

  to fix this i will propose a patch to change the interval form

  interval = randint(0, cfg.CONF.agent_down_time // 2)

  to

  interval = randint(0, cfg.CONF.agent_down_time // 3)

  to increase the likelihood that we send the heartbeat in time.

  when we are making calls to privsep and ovs the logs stop for multiple
  second while those operations are happening and if that happens the
  the wrong time  i belive this leads to use missing the heartbeat
  interval.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2020215/+subscriptions