← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2003359] [NEW] DVR HA router gets stuck in backup state

 

Public bug reported:

We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state

The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.

Сreate a router and quickly delete it while the l3 agent processes the
external GW adding:

[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u devstack@q-l3.service | grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#

As a result fip-ns is not deleted even though the
floatingip_agent_gateway port was removed:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#

Re-create the router together with external gw now:

[root@devstack ~]# openstack router create r1 --ha --distributed
--external-gateway public

In the logs, one can see a traceback that the creation of this router
failed initially, followed by a successful creation:

ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns     self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns     self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns     onlink = device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns     routes = self.list_routes(ip_version, scope='link')
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns     return list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns     routes = privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns     return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
ERROR neutron.agent.l3.dvr_fip_ns     do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns     return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns     return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns     raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
ERROR neutron.agent.l3.dvr_fip_ns     result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns     return self.channel.remote_call(name, args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
ERROR neutron.agent.l3.dvr_fip_ns     raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.


The result is the following state:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+--------------------------------------+----------------------------------+--------+
| ID                                   | Device Owner                     | Status |
+--------------------------------------+----------------------------------+--------+
| 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway           | DOWN   |
| b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE |
+--------------------------------------+----------------------------------+--------+
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
backup
[root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
 Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
...
Access: 2023-01-19 11:10:19.533205713 -0500
Modify: 2023-01-19 11:10:19.533205713 -0500
Change: 2023-01-19 11:10:19.533205713 -0500
 Birth: -
[root@devstack ~]#

By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router.
So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace.

Adding an external gw and removing a router has a race condition and
it's not always possible to get 100% reproduction. To achieve 100%
reproduction, just add a small sleep with the following patch:

[root@devstack neutron]# git diff
diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
index 6e37c09511..d01eb0de9b 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                 self.agent.context, ex_gw_port['network_id'])
             LOG.debug("FloatingIP agent gateway port received from the "
                       "plugin: %s", fip_agent_port)
+            import time
+            time.sleep(5)
         self.fip_ns.create_or_update_gateway_port(fip_agent_port)

     def update_routing_table(self, operation, route):
[root@devstack neutron]#

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2003359

Title:
  DVR HA router gets stuck in backup state

Status in neutron:
  New

Bug description:
  We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
  Preconditions:
  1) there is no router with a specific external network yet
  2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state

  The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed.
  Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.

  Сreate a router and quickly delete it while the l3 agent processes the
  external GW adding:

  [root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
  [root@devstack ~]# count_fip_requests() { journalctl -u devstack@q-l3.service | grep 'FloatingIP agent gateway port received' | wc -l; }
  [root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
  [root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
  waiting before deletion...
  waiting before deletion...
  [root@devstack ~]#

  As a result fip-ns is not deleted even though the
  floatingip_agent_gateway port was removed:

  [root@devstack ~]# ip netns
  fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
  [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
  <empty>
  [root@devstack ~]#

  Re-create the router together with external gw now:

  [root@devstack ~]# openstack router create r1 --ha --distributed
  --external-gateway public

  In the logs, one can see a traceback that the creation of this router
  failed initially, followed by a successful creation:

  ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port
  ERROR neutron.agent.l3.dvr_fip_ns     self._update_gateway_port(
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port
  ERROR neutron.agent.l3.dvr_fip_ns     self.driver.set_onlink_routes(
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes
  ERROR neutron.agent.l3.dvr_fip_ns     onlink = device.route.list_onlink_routes(constants.IP_VERSION_4)
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes
  ERROR neutron.agent.l3.dvr_fip_ns     routes = self.list_routes(ip_version, scope='link')
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
  ERROR neutron.agent.l3.dvr_fip_ns     return list_ip_routes(self._parent.namespace, ip_version, scope=scope,
  ERROR neutron.agent.l3.dvr_fip_ns   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
  ERROR neutron.agent.l3.dvr_fip_ns     routes = privileged.list_ip_routes(namespace, ip_version, device=device,
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
  ERROR neutron.agent.l3.dvr_fip_ns     return self(f, *args, **kw)
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
  ERROR neutron.agent.l3.dvr_fip_ns     do = self.iter(retry_state=retry_state)
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
  ERROR neutron.agent.l3.dvr_fip_ns     return fut.result()
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
  ERROR neutron.agent.l3.dvr_fip_ns     return self.__get_result()
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
  ERROR neutron.agent.l3.dvr_fip_ns     raise self._exception
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
  ERROR neutron.agent.l3.dvr_fip_ns     result = fn(*args, **kwargs)
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
  ERROR neutron.agent.l3.dvr_fip_ns     return self.channel.remote_call(name, args, kwargs,
  ERROR neutron.agent.l3.dvr_fip_ns   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
  ERROR neutron.agent.l3.dvr_fip_ns     raise exc_type(*result[2])
  ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.

  
  The result is the following state:

  [root@devstack ~]# ip netns
  fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
  qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
  snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
  [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
  +--------------------------------------+----------------------------------+--------+
  | ID                                   | Device Owner                     | Status |
  +--------------------------------------+----------------------------------+--------+
  | 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway           | DOWN   |
  | b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE |
  +--------------------------------------+----------------------------------+--------+
  [root@devstack ~]#
  [root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
  backup
  [root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
  ...
  Access: 2023-01-19 11:10:10.715245690 -0500
  Modify: 2023-01-19 11:10:18.976208238 -0500
  Change: 2023-01-19 11:10:18.976208238 -0500
   Birth: 2023-01-19 11:10:10.715245690 -0500
  [root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
  ...
  Access: 2023-01-19 11:10:19.533205713 -0500
  Modify: 2023-01-19 11:10:19.533205713 -0500
  Change: 2023-01-19 11:10:19.533205713 -0500
   Birth: -
  [root@devstack ~]#

  By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router.
  So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace.

  Adding an external gw and removing a router has a race condition and
  it's not always possible to get 100% reproduction. To achieve 100%
  reproduction, just add a small sleep with the following patch:

  [root@devstack neutron]# git diff
  diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
  index 6e37c09511..d01eb0de9b 100644
  --- a/neutron/agent/l3/dvr_local_router.py
  +++ b/neutron/agent/l3/dvr_local_router.py
  @@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                   self.agent.context, ex_gw_port['network_id'])
               LOG.debug("FloatingIP agent gateway port received from the "
                         "plugin: %s", fip_agent_port)
  +            import time
  +            time.sleep(5)
           self.fip_ns.create_or_update_gateway_port(fip_agent_port)

       def update_routing_table(self, operation, route):
  [root@devstack neutron]#

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2003359/+subscriptions