yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1637452] [NEW] HA router & L2 pop: Link could not recover if tunnel NIC meets down and up

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: LIU Yulong <yulong@xxxxxxxxxxxxx>
Date: Fri, 28 Oct 2016 09:38:55 -0000
Reply-to: Bug 1637452 <1637452@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description:
If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up.
And there is no way to quickly recover.


ENV:
stable/mitaka (8.1.2 with backported L3 patches)
VXLAN

Some related settings:

l2_population = true
arp_responder = true
tunnel_types = vxlan
prevent_arp_spoofing = true

l3_ha = True
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2


How to reproduce：
Assuming we have and only have:
two network nodes: N1 and N2
two compute nodes: C1 and C2
two HA routers: R1 and R2
two networks/subnets for VM: NET1 and NET2

Steps:
1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network
    phenomenon: 
      You may see all HA router become both active.
      AKA, R1 and R2 are now all active state in both N1 and N2.

2.create VM1 and VM2
   (You can use `nova boot --availability-zone:specific_host`
    to create VM to the compute node you needed.)
   Then the topology will be:
   R1 -- NET1 -- VM1 (reside in C1)
   R2 -- NET2 -- VM2 (reside in C2)

   phenomenon:
     Now, you may see, C1 and C2 now has only one tunnel to one network node,
     C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.

   (This is because that HA router qr-device port did not change the binding host.
   Although the HA router is multiple active, the `network:router_interface` port
   could only have one ml2_port_bindings host.)

3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network
   phenomenon:  You may see all HA router could back to one active and one standby.
   Assuming：R1's master is in N2.
             R2's master is in N1.

Then problem comes:

remember the topology:
    R1 -- NET1 -- VM1 (reside in C1)
    R2 -- NET2 -- VM2 (reside in C2)
and the tunnels:
    C1 has only one tunnel to N1.
    C2 has only one tunnel to N2.

In other words:
C1 has no tunnel to N2 where the R1's master is in.
C2 has no tunnel to N1 where the R2's master is in.

Finally, TRAFFIC IS NOT REACHABLE.

And restart compute node OVS-agent could not solve the issue,
because we enabled the l2_population,
and there is no new port plugged, so that means:
1.tunnel from C1 to N2 could not be created automatically;
2.tunnel from C2 to N1 could not be created automatically.


One way to recover is to restart N1's neutron services and then restart N2's neutron services.
This may cause HA router's master are all residing in one network node.


Production environment problem:
We got some unknow link down message which here we use ifdown/ifup to simulate:
[  +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down
[  +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1637452

Title:
  HA router & L2 pop: Link could not recover if tunnel  NIC meets down
  and up

Status in neutron:
  New

Bug description:
  Description:
  If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up.
  And there is no way to quickly recover.

  
  ENV:
  stable/mitaka (8.1.2 with backported L3 patches)
  VXLAN

  Some related settings:

  l2_population = true
  arp_responder = true
  tunnel_types = vxlan
  prevent_arp_spoofing = true

  l3_ha = True
  max_l3_agents_per_router = 2
  min_l3_agents_per_router = 2

  
  How to reproduce：
  Assuming we have and only have:
  two network nodes: N1 and N2
  two compute nodes: C1 and C2
  two HA routers: R1 and R2
  two networks/subnets for VM: NET1 and NET2

  Steps:
  1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network
      phenomenon: 
        You may see all HA router become both active.
        AKA, R1 and R2 are now all active state in both N1 and N2.

  2.create VM1 and VM2
     (You can use `nova boot --availability-zone:specific_host`
      to create VM to the compute node you needed.)
     Then the topology will be:
     R1 -- NET1 -- VM1 (reside in C1)
     R2 -- NET2 -- VM2 (reside in C2)

     phenomenon:
       Now, you may see, C1 and C2 now has only one tunnel to one network node,
       C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.

     (This is because that HA router qr-device port did not change the binding host.
     Although the HA router is multiple active, the `network:router_interface` port
     could only have one ml2_port_bindings host.)

  3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network
     phenomenon:  You may see all HA router could back to one active and one standby.
     Assuming：R1's master is in N2.
               R2's master is in N1.

  Then problem comes:

  remember the topology:
      R1 -- NET1 -- VM1 (reside in C1)
      R2 -- NET2 -- VM2 (reside in C2)
  and the tunnels:
      C1 has only one tunnel to N1.
      C2 has only one tunnel to N2.

  In other words:
  C1 has no tunnel to N2 where the R1's master is in.
  C2 has no tunnel to N1 where the R2's master is in.

  Finally, TRAFFIC IS NOT REACHABLE.

  And restart compute node OVS-agent could not solve the issue,
  because we enabled the l2_population,
  and there is no new port plugged, so that means:
  1.tunnel from C1 to N2 could not be created automatically;
  2.tunnel from C2 to N1 could not be created automatically.

  
  One way to recover is to restart N1's neutron services and then restart N2's neutron services.
  This may cause HA router's master are all residing in one network node.


  Production environment problem:
  We got some unknow link down message which here we use ifdown/ifup to simulate:
  [  +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down
  [  +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1637452/+subscriptions
Follow ups

[Bug 1637452] Re: HA router & L2 pop: Link could not recover if tunnel NIC meets down and up
From: LIU Yulong, 2016-11-09