← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2077533] [NEW] An error in processing one DVR router can lead to connectivity issues for other routers

 

Public bug reported:

I investigated the customer's issue and concluded that this code: 
https://opendev.org/openstack/neutron/src/commit/0807c94dc9843fff318c21d1f6f7b8838f948f5f/neutron/agent/l3/dvr_fip_ns.py#L155-L160
which deletes the fip-namespace during router processing, leads to connectivity problems for other routers. This deletion of the fip-namespace also removes the veth pairs rfp/fpr for other routers. However, the reprocessing of those other routers does not occur. As a result, all other routers, except the one that triggered the deletion of the fip-namespace, are left without the rfp/fpr veth pair.

The issue might be difficult to trigger, so I'll demonstrate it with a
small hack:

--- a/neutron/agent/l3/dvr_fip_ns.py
+++ b/neutron/agent/l3/dvr_fip_ns.py
@@ -151,6 +151,11 @@ class FipNamespace(namespaces.Namespace):
                 try:
                     self._update_gateway_port(
                         agent_gateway_port, interface_name)
+                    if getattr(self, 'test_fail', False):
+                        self.test_fail = False
+                        raise Exception('Test Fail')
+                    else:
+                        self.test_fail = True
                 except Exception:
                     # If an exception occurs at this point, then it is
                     # good to clean up the namespace that has been created


1) I create two routers with the same external network:

[root@devstack0 ~]# openstack router create r1 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value                                |
+-------+--------------------------------------+
| id    | 25085e63-45a6-4795-93dc-77cb245664d7 |
+-------+--------------------------------------+
[root@devstack0 ~]# openstack router create r2 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value                                |
+-------+--------------------------------------+
| id    | 3805cd53-5fed-4fa3-9147-f396761fc9cd |
+-------+--------------------------------------+
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
   <cut>
2: fpr-25085e63-4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:e9:5e:65:9c:ad brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
    inet 169.254.120.3/31 scope global fpr-25085e63-4
       valid_lft forever preferred_lft forever
    inet6 fe80::8ce9:5eff:fe65:9cad/64 scope link
       valid_lft forever preferred_lft forever
3: fpr-3805cd53-5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 12:e1:bf:02:98:e0 brd ff:ff:ff:ff:ff:ff link-netns qrouter-3805cd53-5fed-4fa3-9147-f396761fc9cd
    inet 169.254.77.247/31 scope global fpr-3805cd53-5
       valid_lft forever preferred_lft forever
    inet6 fe80::10e1:bfff:fe02:98e0/64 scope link
       valid_lft forever preferred_lft forever
68: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
    inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee4:eae1/64 scope link
       valid_lft forever preferred_lft forever
[root@devstack0 ~]#

2) I trigger an update of router r1 with a failure (see hack), which
leads to the deletion of the fip-namespace and reprocessing of this
router. Updating r1 causes the loss of the veth rfp/fpr pair for router
r2, thus breaking router r2.

[root@devstack0 ~]# openstack router set r1 --name r1-updated
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: fpr-25085e63-4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether fa:68:ef:86:96:a5 brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
    inet 169.254.120.3/31 scope global fpr-25085e63-4
       valid_lft forever preferred_lft forever
    inet6 fe80::f868:efff:fe86:96a5/64 scope link
       valid_lft forever preferred_lft forever
71: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
    inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee4:eae1/64 scope link
       valid_lft forever preferred_lft forever
[root@devstack0 ~]#


P.S.
I investigated a customer issue where they reported internet connectivity loss through their routers. In short, the trigger was a bug I recently created: https://bugs.launchpad.net/neutron/+bug/2077532, where the existence of two floatingip_agent_gateways ports led to an error in _update_gateway_port, which subsequently caused the deletion of veth pairs from all routers depending on the order in which the ports were returned.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2077533

Title:
  An error in processing one DVR router can lead to connectivity issues
  for other routers

Status in neutron:
  New

Bug description:
  I investigated the customer's issue and concluded that this code: 
  https://opendev.org/openstack/neutron/src/commit/0807c94dc9843fff318c21d1f6f7b8838f948f5f/neutron/agent/l3/dvr_fip_ns.py#L155-L160
  which deletes the fip-namespace during router processing, leads to connectivity problems for other routers. This deletion of the fip-namespace also removes the veth pairs rfp/fpr for other routers. However, the reprocessing of those other routers does not occur. As a result, all other routers, except the one that triggered the deletion of the fip-namespace, are left without the rfp/fpr veth pair.

  The issue might be difficult to trigger, so I'll demonstrate it with a
  small hack:

  --- a/neutron/agent/l3/dvr_fip_ns.py
  +++ b/neutron/agent/l3/dvr_fip_ns.py
  @@ -151,6 +151,11 @@ class FipNamespace(namespaces.Namespace):
                   try:
                       self._update_gateway_port(
                           agent_gateway_port, interface_name)
  +                    if getattr(self, 'test_fail', False):
  +                        self.test_fail = False
  +                        raise Exception('Test Fail')
  +                    else:
  +                        self.test_fail = True
                   except Exception:
                       # If an exception occurs at this point, then it is
                       # good to clean up the namespace that has been created

  
  1) I create two routers with the same external network:

  [root@devstack0 ~]# openstack router create r1 --external-gateway public -c id
  +-------+--------------------------------------+
  | Field | Value                                |
  +-------+--------------------------------------+
  | id    | 25085e63-45a6-4795-93dc-77cb245664d7 |
  +-------+--------------------------------------+
  [root@devstack0 ~]# openstack router create r2 --external-gateway public -c id
  +-------+--------------------------------------+
  | Field | Value                                |
  +-------+--------------------------------------+
  | id    | 3805cd53-5fed-4fa3-9147-f396761fc9cd |
  +-------+--------------------------------------+
  [root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
     <cut>
  2: fpr-25085e63-4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
      link/ether 8e:e9:5e:65:9c:ad brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
      inet 169.254.120.3/31 scope global fpr-25085e63-4
         valid_lft forever preferred_lft forever
      inet6 fe80::8ce9:5eff:fe65:9cad/64 scope link
         valid_lft forever preferred_lft forever
  3: fpr-3805cd53-5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
      link/ether 12:e1:bf:02:98:e0 brd ff:ff:ff:ff:ff:ff link-netns qrouter-3805cd53-5fed-4fa3-9147-f396761fc9cd
      inet 169.254.77.247/31 scope global fpr-3805cd53-5
         valid_lft forever preferred_lft forever
      inet6 fe80::10e1:bfff:fe02:98e0/64 scope link
         valid_lft forever preferred_lft forever
  68: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
      inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fee4:eae1/64 scope link
         valid_lft forever preferred_lft forever
  [root@devstack0 ~]#

  2) I trigger an update of router r1 with a failure (see hack), which
  leads to the deletion of the fip-namespace and reprocessing of this
  router. Updating r1 causes the loss of the veth rfp/fpr pair for
  router r2, thus breaking router r2.

  [root@devstack0 ~]# openstack router set r1 --name r1-updated
  [root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever
  2: fpr-25085e63-4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
      link/ether fa:68:ef:86:96:a5 brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
      inet 169.254.120.3/31 scope global fpr-25085e63-4
         valid_lft forever preferred_lft forever
      inet6 fe80::f868:efff:fe86:96a5/64 scope link
         valid_lft forever preferred_lft forever
  71: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
      inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fee4:eae1/64 scope link
         valid_lft forever preferred_lft forever
  [root@devstack0 ~]#

  
  P.S.
  I investigated a customer issue where they reported internet connectivity loss through their routers. In short, the trigger was a bug I recently created: https://bugs.launchpad.net/neutron/+bug/2077532, where the existence of two floatingip_agent_gateways ports led to an error in _update_gateway_port, which subsequently caused the deletion of veth pairs from all routers depending on the order in which the ports were returned.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2077533/+subscriptions