yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88018
[Bug 1956958] [NEW] Functional tests for HA routers fails due to router transitioned to FAULT state
Public bug reported:
Example of the failure:
https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-
functional-with-uwsgi/1938908/testr_results.html
Stacktrace:
ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
eventlet.sleep(sleep)
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
hub.switch()
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
return self.greenlet.switch()
eventlet.timeout.Timeout: 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
utils.wait_until_true(lambda: primary.ha_state == 'backup')
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
raise WaitTimeout(_("Timed out after %d seconds") % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds
>From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:
2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2
but router on the second agent not:
2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current
transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012: backup;
Initial state was: primary _enqueue_state_change
/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158
In the journal log I see something like:
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19
I'm not sure if that is really the main issue why the test failed but we
probably will need to add some more logs to the L3 HA functional tests
and investigate it more in the future when similar failures will happen
again.
** Affects: neutron
Importance: High
Assignee: Slawek Kaplonski (slaweq)
Status: Confirmed
** Tags: functional-tests gate-failure l3-ha
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1956958
Title:
Functional tests for HA routers fails due to router transitioned to
FAULT state
Status in neutron:
Confirmed
Bug description:
Example of the failure:
https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-
functional-with-uwsgi/1938908/testr_results.html
Stacktrace:
ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
eventlet.sleep(sleep)
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
hub.switch()
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
return self.greenlet.switch()
eventlet.timeout.Timeout: 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
utils.wait_until_true(lambda: primary.ha_state == 'backup')
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
raise WaitTimeout(_("Timed out after %d seconds") % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds
From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:
2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2
but router on the second agent not:
2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current
transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012:
backup; Initial state was: primary _enqueue_state_change
/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158
In the journal log I see something like:
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19
I'm not sure if that is really the main issue why the test failed but
we probably will need to add some more logs to the L3 HA functional
tests and investigate it more in the future when similar failures will
happen again.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1956958/+subscriptions