yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #58196
[Bug 1637452] [NEW] HA router & L2 pop: Link could not recover if tunnel NIC meets down and up
Public bug reported:
Description:
If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up.
And there is no way to quickly recover.
ENV:
stable/mitaka (8.1.2 with backported L3 patches)
VXLAN
Some related settings:
l2_population = true
arp_responder = true
tunnel_types = vxlan
prevent_arp_spoofing = true
l3_ha = True
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2
How to reproduce:
Assuming we have and only have:
two network nodes: N1 and N2
two compute nodes: C1 and C2
two HA routers: R1 and R2
two networks/subnets for VM: NET1 and NET2
Steps:
1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network
phenomenon:
You may see all HA router become both active.
AKA, R1 and R2 are now all active state in both N1 and N2.
2.create VM1 and VM2
(You can use `nova boot --availability-zone:specific_host`
to create VM to the compute node you needed.)
Then the topology will be:
R1 -- NET1 -- VM1 (reside in C1)
R2 -- NET2 -- VM2 (reside in C2)
phenomenon:
Now, you may see, C1 and C2 now has only one tunnel to one network node,
C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.
(This is because that HA router qr-device port did not change the binding host.
Although the HA router is multiple active, the `network:router_interface` port
could only have one ml2_port_bindings host.)
3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network
phenomenon: You may see all HA router could back to one active and one standby.
Assuming:R1's master is in N2.
R2's master is in N1.
Then problem comes:
remember the topology:
R1 -- NET1 -- VM1 (reside in C1)
R2 -- NET2 -- VM2 (reside in C2)
and the tunnels:
C1 has only one tunnel to N1.
C2 has only one tunnel to N2.
In other words:
C1 has no tunnel to N2 where the R1's master is in.
C2 has no tunnel to N1 where the R2's master is in.
Finally, TRAFFIC IS NOT REACHABLE.
And restart compute node OVS-agent could not solve the issue,
because we enabled the l2_population,
and there is no new port plugged, so that means:
1.tunnel from C1 to N2 could not be created automatically;
2.tunnel from C2 to N1 could not be created automatically.
One way to recover is to restart N1's neutron services and then restart N2's neutron services.
This may cause HA router's master are all residing in one network node.
Production environment problem:
We got some unknow link down message which here we use ifdown/ifup to simulate:
[ +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down
[ +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1637452
Title:
HA router & L2 pop: Link could not recover if tunnel NIC meets down
and up
Status in neutron:
New
Bug description:
Description:
If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up.
And there is no way to quickly recover.
ENV:
stable/mitaka (8.1.2 with backported L3 patches)
VXLAN
Some related settings:
l2_population = true
arp_responder = true
tunnel_types = vxlan
prevent_arp_spoofing = true
l3_ha = True
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2
How to reproduce:
Assuming we have and only have:
two network nodes: N1 and N2
two compute nodes: C1 and C2
two HA routers: R1 and R2
two networks/subnets for VM: NET1 and NET2
Steps:
1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network
phenomenon:
You may see all HA router become both active.
AKA, R1 and R2 are now all active state in both N1 and N2.
2.create VM1 and VM2
(You can use `nova boot --availability-zone:specific_host`
to create VM to the compute node you needed.)
Then the topology will be:
R1 -- NET1 -- VM1 (reside in C1)
R2 -- NET2 -- VM2 (reside in C2)
phenomenon:
Now, you may see, C1 and C2 now has only one tunnel to one network node,
C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.
(This is because that HA router qr-device port did not change the binding host.
Although the HA router is multiple active, the `network:router_interface` port
could only have one ml2_port_bindings host.)
3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network
phenomenon: You may see all HA router could back to one active and one standby.
Assuming:R1's master is in N2.
R2's master is in N1.
Then problem comes:
remember the topology:
R1 -- NET1 -- VM1 (reside in C1)
R2 -- NET2 -- VM2 (reside in C2)
and the tunnels:
C1 has only one tunnel to N1.
C2 has only one tunnel to N2.
In other words:
C1 has no tunnel to N2 where the R1's master is in.
C2 has no tunnel to N1 where the R2's master is in.
Finally, TRAFFIC IS NOT REACHABLE.
And restart compute node OVS-agent could not solve the issue,
because we enabled the l2_population,
and there is no new port plugged, so that means:
1.tunnel from C1 to N2 could not be created automatically;
2.tunnel from C2 to N1 could not be created automatically.
One way to recover is to restart N1's neutron services and then restart N2's neutron services.
This may cause HA router's master are all residing in one network node.
Production environment problem:
We got some unknow link down message which here we use ifdown/ifup to simulate:
[ +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down
[ +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1637452/+subscriptions
Follow ups