yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94451
[Bug 2077533] [NEW] An error in processing one DVR router can lead to connectivity issues for other routers
Public bug reported:
I investigated the customer's issue and concluded that this code:
https://opendev.org/openstack/neutron/src/commit/0807c94dc9843fff318c21d1f6f7b8838f948f5f/neutron/agent/l3/dvr_fip_ns.py#L155-L160
which deletes the fip-namespace during router processing, leads to connectivity problems for other routers. This deletion of the fip-namespace also removes the veth pairs rfp/fpr for other routers. However, the reprocessing of those other routers does not occur. As a result, all other routers, except the one that triggered the deletion of the fip-namespace, are left without the rfp/fpr veth pair.
The issue might be difficult to trigger, so I'll demonstrate it with a
small hack:
--- a/neutron/agent/l3/dvr_fip_ns.py
+++ b/neutron/agent/l3/dvr_fip_ns.py
@@ -151,6 +151,11 @@ class FipNamespace(namespaces.Namespace):
try:
self._update_gateway_port(
agent_gateway_port, interface_name)
+ if getattr(self, 'test_fail', False):
+ self.test_fail = False
+ raise Exception('Test Fail')
+ else:
+ self.test_fail = True
except Exception:
# If an exception occurs at this point, then it is
# good to clean up the namespace that has been created
1) I create two routers with the same external network:
[root@devstack0 ~]# openstack router create r1 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value |
+-------+--------------------------------------+
| id | 25085e63-45a6-4795-93dc-77cb245664d7 |
+-------+--------------------------------------+
[root@devstack0 ~]# openstack router create r2 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value |
+-------+--------------------------------------+
| id | 3805cd53-5fed-4fa3-9147-f396761fc9cd |
+-------+--------------------------------------+
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
<cut>
2: fpr-25085e63-4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 8e:e9:5e:65:9c:ad brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
inet 169.254.120.3/31 scope global fpr-25085e63-4
valid_lft forever preferred_lft forever
inet6 fe80::8ce9:5eff:fe65:9cad/64 scope link
valid_lft forever preferred_lft forever
3: fpr-3805cd53-5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 12:e1:bf:02:98:e0 brd ff:ff:ff:ff:ff:ff link-netns qrouter-3805cd53-5fed-4fa3-9147-f396761fc9cd
inet 169.254.77.247/31 scope global fpr-3805cd53-5
valid_lft forever preferred_lft forever
inet6 fe80::10e1:bfff:fe02:98e0/64 scope link
valid_lft forever preferred_lft forever
68: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fee4:eae1/64 scope link
valid_lft forever preferred_lft forever
[root@devstack0 ~]#
2) I trigger an update of router r1 with a failure (see hack), which
leads to the deletion of the fip-namespace and reprocessing of this
router. Updating r1 causes the loss of the veth rfp/fpr pair for router
r2, thus breaking router r2.
[root@devstack0 ~]# openstack router set r1 --name r1-updated
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: fpr-25085e63-4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether fa:68:ef:86:96:a5 brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
inet 169.254.120.3/31 scope global fpr-25085e63-4
valid_lft forever preferred_lft forever
inet6 fe80::f868:efff:fe86:96a5/64 scope link
valid_lft forever preferred_lft forever
71: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fee4:eae1/64 scope link
valid_lft forever preferred_lft forever
[root@devstack0 ~]#
P.S.
I investigated a customer issue where they reported internet connectivity loss through their routers. In short, the trigger was a bug I recently created: https://bugs.launchpad.net/neutron/+bug/2077532, where the existence of two floatingip_agent_gateways ports led to an error in _update_gateway_port, which subsequently caused the deletion of veth pairs from all routers depending on the order in which the ports were returned.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2077533
Title:
An error in processing one DVR router can lead to connectivity issues
for other routers
Status in neutron:
New
Bug description:
I investigated the customer's issue and concluded that this code:
https://opendev.org/openstack/neutron/src/commit/0807c94dc9843fff318c21d1f6f7b8838f948f5f/neutron/agent/l3/dvr_fip_ns.py#L155-L160
which deletes the fip-namespace during router processing, leads to connectivity problems for other routers. This deletion of the fip-namespace also removes the veth pairs rfp/fpr for other routers. However, the reprocessing of those other routers does not occur. As a result, all other routers, except the one that triggered the deletion of the fip-namespace, are left without the rfp/fpr veth pair.
The issue might be difficult to trigger, so I'll demonstrate it with a
small hack:
--- a/neutron/agent/l3/dvr_fip_ns.py
+++ b/neutron/agent/l3/dvr_fip_ns.py
@@ -151,6 +151,11 @@ class FipNamespace(namespaces.Namespace):
try:
self._update_gateway_port(
agent_gateway_port, interface_name)
+ if getattr(self, 'test_fail', False):
+ self.test_fail = False
+ raise Exception('Test Fail')
+ else:
+ self.test_fail = True
except Exception:
# If an exception occurs at this point, then it is
# good to clean up the namespace that has been created
1) I create two routers with the same external network:
[root@devstack0 ~]# openstack router create r1 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value |
+-------+--------------------------------------+
| id | 25085e63-45a6-4795-93dc-77cb245664d7 |
+-------+--------------------------------------+
[root@devstack0 ~]# openstack router create r2 --external-gateway public -c id
+-------+--------------------------------------+
| Field | Value |
+-------+--------------------------------------+
| id | 3805cd53-5fed-4fa3-9147-f396761fc9cd |
+-------+--------------------------------------+
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
<cut>
2: fpr-25085e63-4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 8e:e9:5e:65:9c:ad brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
inet 169.254.120.3/31 scope global fpr-25085e63-4
valid_lft forever preferred_lft forever
inet6 fe80::8ce9:5eff:fe65:9cad/64 scope link
valid_lft forever preferred_lft forever
3: fpr-3805cd53-5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 12:e1:bf:02:98:e0 brd ff:ff:ff:ff:ff:ff link-netns qrouter-3805cd53-5fed-4fa3-9147-f396761fc9cd
inet 169.254.77.247/31 scope global fpr-3805cd53-5
valid_lft forever preferred_lft forever
inet6 fe80::10e1:bfff:fe02:98e0/64 scope link
valid_lft forever preferred_lft forever
68: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fee4:eae1/64 scope link
valid_lft forever preferred_lft forever
[root@devstack0 ~]#
2) I trigger an update of router r1 with a failure (see hack), which
leads to the deletion of the fip-namespace and reprocessing of this
router. Updating r1 causes the loss of the veth rfp/fpr pair for
router r2, thus breaking router r2.
[root@devstack0 ~]# openstack router set r1 --name r1-updated
[root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: fpr-25085e63-4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether fa:68:ef:86:96:a5 brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7
inet 169.254.120.3/31 scope global fpr-25085e63-4
valid_lft forever preferred_lft forever
inet6 fe80::f868:efff:fe86:96a5/64 scope link
valid_lft forever preferred_lft forever
71: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff
inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fee4:eae1/64 scope link
valid_lft forever preferred_lft forever
[root@devstack0 ~]#
P.S.
I investigated a customer issue where they reported internet connectivity loss through their routers. In short, the trigger was a bug I recently created: https://bugs.launchpad.net/neutron/+bug/2077532, where the existence of two floatingip_agent_gateways ports led to an error in _update_gateway_port, which subsequently caused the deletion of veth pairs from all routers depending on the order in which the ports were returned.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2077533/+subscriptions
Follow ups