yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1859832] Re: L3 HA connectivity to GW port can be broken after reboot of backup node

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1859832@xxxxxxxxxxxxxxxxxx>
Date: Fri, 03 Apr 2020 23:41:29 -0000
Reply-to: Bug 1859832 <1859832@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.opendev.org/707406
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c52029c39aa824a67095fbbf9e59eff769d92587
Submitter: Zuul
Branch:    master

commit c52029c39aa824a67095fbbf9e59eff769d92587
Author: LIU Yulong <i@xxxxxxxxxxxx>
Date:   Thu Oct 31 19:06:37 2019 +0800

    Do not link up HA router gateway in backup node
    
    L3 router will set its devices link up by default.
    For HA routers, the gateway device will be pluged
    in all scheduled hosts. When the gateway deivce is
    up in backup node, it will send out IPv6 related
    packets (MLDv2) according to some kernal config.
    This will cause the physical fabric think that the
    gateway MAC is now working in the backup node. And
    finally the master node L3 traffic will be broken.
    
    This patch sets the backup gateway device link down
    by default. When the VRRP sets the master state in
    one host, the L3 agent state change procedure will
    do link up action for the gateway device.
    
    Closes-Bug: #1859832
    Change-Id: I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1859832

Title:
  L3 HA connectivity to GW port can be broken after reboot of backup
  node

Status in neutron:
  Fix Released

Bug description:
  When neutron router is on some network node in backup state (other network node is "active" for this router), and such network node will be rebooted it may happen that connectivity to router's gateway port will be broken.
  It can happen due to race between L3 agent and OVS agent and is easier to reproduce when You have many routers in backup state on such node.
  I was testing it with 10 routers, all in backup state. In such case 1 or 2 routers had got broken connectivity after reboot of host.

  It is like that because when L3 agent adds interface to the router, it checks if there is any IPv6 link-local address on interface and if there is, it flush such IPv6 addresses and adds them to keepalived config. So keepalived can manage such IPs as any other IP address from this interface.
  But the problem is that when IPv6 address is removed from the interface, it sends MLDv2 packets to unsubsribe from multicast group. And if those packets will go out from host e.g. to ToR switch, such switch will learn that MAC address of gw port is on wrong host (this rebooted one instead of one where router is in master state).

  Thos MLDv2 packets aren't send to the wire for each router but only for some of them due to race.
  Basically new qg-XXX port is created in br-int by L3 agent with DEAD_VLAN_TAG (4095) and than both agents, L3 and OVS are configuring it. If L3 agent flush IPv6 addresses from this interface BEFORE OVS agent sets proper tag (local_vlan_id) for the port, than all is fine because MLDv2 packets are dropped. But if L3 agent will flush AFTER tag is changed, than MLDv2 packets are send to the wire and cause ingress connectivity break.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1859832/+subscriptions

References

[Bug 1859832] [NEW] L3 HA connectivity to GW port can be broken after reboot of backup node
From: Slawek Kaplonski, 2020-01-15