← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1818614] Re: Various L3HA functional tests fails often

 

** Description changed:

+ [Impact]
+ Need to get this added to the Ubuntu packages in order to safeguard against missed VRRP transitions due to ip -o monitor not running at the time the transition occurs. We have seen many cases in the fields where neutron routers end up as active on multiple l3 agents (via neutron api) which leads to a number of problems.
+ 
+ [Test Case]
+ * deploy Openstack (any version that supports l3ha)
+ * create HA router with max-l3-agents=2
+ * check neutron l3-agent-list-hosting-router for master location
+ * on both hosts that are running the l3-agent do
+ 
+ pid=`pgrep -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"`
+ ps -f --ppid $pid
+ pkill -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"
+ ps -f --ppid $pid <<<<<<<<<<< this should return nothing now
+ pkill -f "/var/lib/neutron/ha_confs/$ROUTER_UUID/keepalived.conf"
+ 
+ * without this patch you should now see both agents reporting the router as "active"
+ * with the patch this should not happen (once neutron-keepalived-state-change has been restarted)
+ 
+ [Regression Potential]
+ 
+ ====================================================================
+ 
  Recently many L3 HA related functional tests are failing.
  The common thing in all those errors is fact that it fails when waiting for l3 ha router to become master.
  
  Example stack trace:
  
  ft2.12: neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle_StringException: Traceback (most recent call last):
-   File "neutron/tests/base.py", line 174, in func
-     return f(self, *args, **kwargs)
-   File "neutron/tests/base.py", line 174, in func
-     return f(self, *args, **kwargs)
-   File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in test_ha_router_lifecycle
-     self._router_lifecycle(enable_ha=True, router_info=router_info)
-   File "neutron/tests/functional/agent/l3/framework.py", line 274, in _router_lifecycle
-     common_utils.wait_until_true(lambda: router.ha_state == 'master')
-   File "neutron/common/utils.py", line 690, in wait_until_true
-     raise WaitTimeout(_("Timed out after %d seconds") % timeout)
+   File "neutron/tests/base.py", line 174, in func
+     return f(self, *args, **kwargs)
+   File "neutron/tests/base.py", line 174, in func
+     return f(self, *args, **kwargs)
+   File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in test_ha_router_lifecycle
+     self._router_lifecycle(enable_ha=True, router_info=router_info)
+   File "neutron/tests/functional/agent/l3/framework.py", line 274, in _router_lifecycle
+     common_utils.wait_until_true(lambda: router.ha_state == 'master')
+   File "neutron/common/utils.py", line 690, in wait_until_true
+     raise WaitTimeout(_("Timed out after %d seconds") % timeout)
  neutron.common.utils.WaitTimeout: Timed out after 60 seconds
  
  Example failure: http://logs.openstack.org/79/633979/21/check/neutron-
  functional-python27/ce7ef07/logs/testr_results.html.gz
  
  Logstash query:
  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ha_state%20%3D%3D%20'master')%5C%22

** Description changed:

  [Impact]
  Need to get this added to the Ubuntu packages in order to safeguard against missed VRRP transitions due to ip -o monitor not running at the time the transition occurs. We have seen many cases in the fields where neutron routers end up as active on multiple l3 agents (via neutron api) which leads to a number of problems.
  
  [Test Case]
  * deploy Openstack (any version that supports l3ha)
  * create HA router with max-l3-agents=2
  * check neutron l3-agent-list-hosting-router for master location
  * on both hosts that are running the l3-agent do
  
  pid=`pgrep -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"`
  ps -f --ppid $pid
  pkill -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"
  ps -f --ppid $pid <<<<<<<<<<< this should return nothing now
  pkill -f "/var/lib/neutron/ha_confs/$ROUTER_UUID/keepalived.conf"
  
  * without this patch you should now see both agents reporting the router as "active"
  * with the patch this should not happen (once neutron-keepalived-state-change has been restarted)
  
  [Regression Potential]
+ None expected.
  
  ====================================================================
  
  Recently many L3 HA related functional tests are failing.
  The common thing in all those errors is fact that it fails when waiting for l3 ha router to become master.
  
  Example stack trace:
  
  ft2.12: neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle_StringException: Traceback (most recent call last):
    File "neutron/tests/base.py", line 174, in func
      return f(self, *args, **kwargs)
    File "neutron/tests/base.py", line 174, in func
      return f(self, *args, **kwargs)
    File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in test_ha_router_lifecycle
      self._router_lifecycle(enable_ha=True, router_info=router_info)
    File "neutron/tests/functional/agent/l3/framework.py", line 274, in _router_lifecycle
      common_utils.wait_until_true(lambda: router.ha_state == 'master')
    File "neutron/common/utils.py", line 690, in wait_until_true
      raise WaitTimeout(_("Timed out after %d seconds") % timeout)
  neutron.common.utils.WaitTimeout: Timed out after 60 seconds
  
  Example failure: http://logs.openstack.org/79/633979/21/check/neutron-
  functional-python27/ce7ef07/logs/testr_results.html.gz
  
  Logstash query:
  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ha_state%20%3D%3D%20'master')%5C%22

** Summary changed:

- Various L3HA functional tests fails often
+ [SRU] Various L3HA functional tests fails often

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/pike
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1818614

Title:
  [SRU] Various L3HA functional tests fails often

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive pike series:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in neutron:
  Fix Released

Bug description:
  [Impact]
  Need to get this added to the Ubuntu packages in order to safeguard against missed VRRP transitions due to ip -o monitor not running at the time the transition occurs. We have seen many cases in the fields where neutron routers end up as active on multiple l3 agents (via neutron api) which leads to a number of problems.

  [Test Case]
  * deploy Openstack (any version that supports l3ha)
  * create HA router with max-l3-agents=2
  * check neutron l3-agent-list-hosting-router for master location
  * on both hosts that are running the l3-agent do

  pid=`pgrep -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"`
  ps -f --ppid $pid
  pkill -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"
  ps -f --ppid $pid <<<<<<<<<<< this should return nothing now
  pkill -f "/var/lib/neutron/ha_confs/$ROUTER_UUID/keepalived.conf"

  * without this patch you should now see both agents reporting the router as "active"
  * with the patch this should not happen (once neutron-keepalived-state-change has been restarted)

  [Regression Potential]
  None expected.

  ====================================================================

  Recently many L3 HA related functional tests are failing.
  The common thing in all those errors is fact that it fails when waiting for l3 ha router to become master.

  Example stack trace:

  ft2.12: neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle_StringException: Traceback (most recent call last):
    File "neutron/tests/base.py", line 174, in func
      return f(self, *args, **kwargs)
    File "neutron/tests/base.py", line 174, in func
      return f(self, *args, **kwargs)
    File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in test_ha_router_lifecycle
      self._router_lifecycle(enable_ha=True, router_info=router_info)
    File "neutron/tests/functional/agent/l3/framework.py", line 274, in _router_lifecycle
      common_utils.wait_until_true(lambda: router.ha_state == 'master')
    File "neutron/common/utils.py", line 690, in wait_until_true
      raise WaitTimeout(_("Timed out after %d seconds") % timeout)
  neutron.common.utils.WaitTimeout: Timed out after 60 seconds

  Example failure: http://logs.openstack.org/79/633979/21/check/neutron-
  functional-python27/ce7ef07/logs/testr_results.html.gz

  Logstash query:
  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ha_state%20%3D%3D%20'master')%5C%22

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1818614/+subscriptions


References