← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1956846] [NEW] ha router duplicated routes

 

Public bug reported:

In our openstack stein installation (neutron 14.4.2) we upgraded keepalived from 1.3.9 to 2.2.4.
After that when restarting the neutron-l3-agent we saw that the router state of routers with external gatways were not able to update anymore. So we ended up with only standby routers even though keepalived is working fine and we can see one keepalived in master state.

After debugging a bit we found the following traceback:
```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha.py", line 166, in _enqueue_state_change
    ri.set_external_gw_port_link_status(link_up=True, set_gw=True)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 547, in set_external_gw_port_link_status
    ns_name, preserve_ips)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 750, in _external_gateway_settings
    clean_connections=True)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 179, in init_router_port
    preserve_ips)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 203, in set_onlink_routes
    device.route.add_onlink_route(route)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 662, in add_onlink_route
    self.add_route(cidr, scope='link')
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 712, in add_route
    self._run_as_root_detect_device_not_found([ip_version], args)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 615, in _run_as_root_detect_device_not_found
    raise exceptions.DeviceNotFoundError(device_name=self.name)
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 610, in _run_as_root_detect_device_not_found
    return self._as_root(options, tuple(args))
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 407, in _as_root
    use_root_namespace=use_root_namespace)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 121, in _as_root
    namespace=namespace)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 129, in _execute
    log_fail_as_error=self.log_fail_as_error)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in execute
    returncode=returncode)
neutron_lib.exceptions.ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: File exists
```

This traceback is triggered because the routers got duplicated routes, here an example router:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```

First I thought that we hit a keepalived bug which I filed here:
https://github.com/acassen/keepalived/issues/2076


What I got to understand from the communication with pqarmitage from the issue is that keepalived is setting `proto 18/keepalived` in newer versions and I think that this breaks with neutron.


So what I assume is happening here is the following.
On a "fresh" router or a failover the qg- interface of the router is down, therefore keepalived is not able to set the virtual routes. neutron then creates the gateway routes through the set_external_gw_port_link_status function (https://opendev.org/openstack/neutron/src/tag/14.4.2/neutron/agent/l3/ha_router.py#L528) after it brings up the qg- interface. When I now restart the neutron-l3-agent it reloads keepalived which triggers keepalived to recreate the virtual_routes which it was not able to create when the qg- interface was down and because of the new functionality it creates the same route but with the addition of `proto 18` and we end up with duplicated routes. After that neutron fails on the `ip route replace` command with the RTNETLINK answers: File exists error.

Here is a router example of getting into this state:
backup router:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
```

After a failover to the backup node where I assume neutron is setting the gateway routes:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet 169.254.0.13/24 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.10/24 scope global qr-15d63a29-8e
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea9:cc49/64 scope link nodad
       valid_lft forever preferred_lft forever
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
    inet x.x.244.116/26 scope global qg-6c2ee5e0-ad
       valid_lft forever preferred_lft forever
    inet6 x.x:1003::22b/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed2:6cc7/64 scope link nodad
       valid_lft forever preferred_lft forever


ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```

And then after a neutron-l3-agent restart which triggers a keepalived reload:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

```

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1956846

Title:
  ha router duplicated routes

Status in neutron:
  New

Bug description:
  In our openstack stein installation (neutron 14.4.2) we upgraded keepalived from 1.3.9 to 2.2.4.
  After that when restarting the neutron-l3-agent we saw that the router state of routers with external gatways were not able to update anymore. So we ended up with only standby routers even though keepalived is working fine and we can see one keepalived in master state.

  After debugging a bit we found the following traceback:
  ```
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
      timer()
    File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha.py", line 166, in _enqueue_state_change
      ri.set_external_gw_port_link_status(link_up=True, set_gw=True)
    File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 547, in set_external_gw_port_link_status
      ns_name, preserve_ips)
    File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 750, in _external_gateway_settings
      clean_connections=True)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 179, in init_router_port
      preserve_ips)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 203, in set_onlink_routes
      device.route.add_onlink_route(route)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 662, in add_onlink_route
      self.add_route(cidr, scope='link')
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 712, in add_route
      self._run_as_root_detect_device_not_found([ip_version], args)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 615, in _run_as_root_detect_device_not_found
      raise exceptions.DeviceNotFoundError(device_name=self.name)
    File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
      self.force_reraise()
    File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
    File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
      raise value
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 610, in _run_as_root_detect_device_not_found
      return self._as_root(options, tuple(args))
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 407, in _as_root
      use_root_namespace=use_root_namespace)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 121, in _as_root
      namespace=namespace)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 129, in _execute
      log_fail_as_error=self.log_fail_as_error)
    File "/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in execute
      returncode=returncode)
  neutron_lib.exceptions.ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: File exists
  ```

  This traceback is triggered because the routers got duplicated routes, here an example router:
  ```
  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
  default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
  default via x.x.244.67 dev qg-6c2ee5e0-ad
  10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
  169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
  169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
  x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
  x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
  x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
  ```

  First I thought that we hit a keepalived bug which I filed here:
  https://github.com/acassen/keepalived/issues/2076

  
  What I got to understand from the communication with pqarmitage from the issue is that keepalived is setting `proto 18/keepalived` in newer versions and I think that this breaks with neutron.

  
  So what I assume is happening here is the following.
  On a "fresh" router or a failover the qg- interface of the router is down, therefore keepalived is not able to set the virtual routes. neutron then creates the gateway routes through the set_external_gw_port_link_status function (https://opendev.org/openstack/neutron/src/tag/14.4.2/neutron/agent/l3/ha_router.py#L528) after it brings up the qg- interface. When I now restart the neutron-l3-agent it reloads keepalived which triggers keepalived to recreate the virtual_routes which it was not able to create when the qg- interface was down and because of the new functionality it creates the same route but with the addition of `proto 18` and we end up with duplicated routes. After that neutron fails on the `ip route replace` command with the RTNETLINK answers: File exists error.

  Here is a router example of getting into this state:
  backup router:
  ```
  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever
  1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
      inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fe56:3f59/64 scope link
         valid_lft forever preferred_lft forever
  1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
  1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
      link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff

  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
  169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
  ```

  After a failover to the backup node where I assume neutron is setting the gateway routes:
  ```
  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever
  1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
      inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
         valid_lft forever preferred_lft forever
      inet 169.254.0.13/24 scope global ha-f64d319f-ed
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fe56:3f59/64 scope link
         valid_lft forever preferred_lft forever
  1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
      inet 10.0.0.10/24 scope global qr-15d63a29-8e
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fea9:cc49/64 scope link nodad
         valid_lft forever preferred_lft forever
  1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
      inet x.x.244.116/26 scope global qg-6c2ee5e0-ad
         valid_lft forever preferred_lft forever
      inet6 x.x:1003::22b/64 scope global nodad
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fed2:6cc7/64 scope link nodad
         valid_lft forever preferred_lft forever

  
  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
  default via x.x.244.67 dev qg-6c2ee5e0-ad
  10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
  169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
  169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
  x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
  x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
  ```

  And then after a neutron-l3-agent restart which triggers a keepalived reload:
  ```
  ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
  default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
  default via x.x.244.67 dev qg-6c2ee5e0-ad
  10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
  169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
  169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
  x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
  x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
  x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

  ```

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1956846/+subscriptions



Follow ups