← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1956958] [NEW] Functional tests for HA routers fails due to router transitioned to FAULT state

 

Public bug reported:

Example of the failure:
https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-
functional-with-uwsgi/1938908/testr_results.html

Stacktrace:

ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
    eventlet.sleep(sleep)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
    hub.switch()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
    return f(self, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
    return f(self, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
    self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
    utils.wait_until_true(lambda: primary.ha_state == 'backup')
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
    raise WaitTimeout(_("Timed out after %d seconds") % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds


>From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:

2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2


but router on the second agent not:

2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current
transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012: backup;
Initial state was: primary _enqueue_state_change
/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158


In the journal log I see something like:

sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19

I'm not sure if that is really the main issue why the test failed but we
probably will need to add some more logs to the L3 HA functional tests
and investigate it more in the future when similar failures will happen
again.

** Affects: neutron
     Importance: High
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed


** Tags: functional-tests gate-failure l3-ha

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1956958

Title:
  Functional tests for HA routers fails due to router transitioned to
  FAULT state

Status in neutron:
  Confirmed

Bug description:
  Example of the failure:
  https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-
  functional-with-uwsgi/1938908/testr_results.html

  Stacktrace:

  ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
      eventlet.sleep(sleep)
    File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
      hub.switch()
    File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
      return self.greenlet.switch()
  eventlet.timeout.Timeout: 60 seconds

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
      return f(self, *args, **kwargs)
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
      return f(self, *args, **kwargs)
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
      self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
      utils.wait_until_true(lambda: primary.ha_state == 'backup')
    File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
      raise WaitTimeout(_("Timed out after %d seconds") % timeout)
  neutron.common.utils.WaitTimeout: Timed out after 60 seconds

  
  From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:

  2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
  2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2

  
  but router on the second agent not:

  2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current
  transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012:
  backup; Initial state was: primary _enqueue_state_change
  /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158

  
  In the journal log I see something like:

  sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
  sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
  sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
  sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
  sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19

  I'm not sure if that is really the main issue why the test failed but
  we probably will need to add some more logs to the L3 HA functional
  tests and investigate it more in the future when similar failures will
  happen again.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1956958/+subscriptions