← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1636466] [NEW] HA router interface points to wrong host after network disruption

 

Public bug reported:

If overlay network of a network node is down for a while, the slave node of HA router can't receive the VRRP packet, so it will premote itself as the master node. Then L3 agent updates ha_state of the router bound with itself to active and updates port bindings of the router interfaces to the associated host. 
After network recovery, one of the two master nodes of a HA router will be degraded to the slave node. If the degraded node is exactly the previous slave node, L3 agent updates ha_state of the router bound with itself to standby but won't update port bindings of the router interfaces to the host hosting the original master node. Then packets sent to the router are sent to the slave node because l2pop uses the incorrect port bindings.
As the keepalived configuration priority are the same 50, the probability of occurrence of the above problem in two network node scenario is 50%.

How to reproduce:
- two network nodes: host1, host2.
- create a ha router: router1, a network: network1 and a subnet: subnet1, add interface of subnet1 to router1.
- disconnect host1 from the overlay network, wait until the l3-agent-list-hosting-router api show that the two ha_state of router1 are both active.
- restore the overlay network of host1, wait until one ha_state of router1 turn to standby. There is a 50% probability that the port binding of the interface of router1 would be inconsistent with the host hosting the active node. Then instances in subnet1 can't reach the router interface.

Expected behavior:
- update ha_state of a HA router to standby should trigger to update port binding of the router interfaces to the host whose ha_state is active.

Affected versions:
can be reproduced in master branch, guess mitaka and newton are also affected.

** Affects: neutron
     Importance: Undecided
     Assignee: Quan Tian (tianquan23)
         Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1636466

Title:
  HA router interface points to wrong host after network disruption

Status in neutron:
  In Progress

Bug description:
  If overlay network of a network node is down for a while, the slave node of HA router can't receive the VRRP packet, so it will premote itself as the master node. Then L3 agent updates ha_state of the router bound with itself to active and updates port bindings of the router interfaces to the associated host. 
  After network recovery, one of the two master nodes of a HA router will be degraded to the slave node. If the degraded node is exactly the previous slave node, L3 agent updates ha_state of the router bound with itself to standby but won't update port bindings of the router interfaces to the host hosting the original master node. Then packets sent to the router are sent to the slave node because l2pop uses the incorrect port bindings.
  As the keepalived configuration priority are the same 50, the probability of occurrence of the above problem in two network node scenario is 50%.

  How to reproduce:
  - two network nodes: host1, host2.
  - create a ha router: router1, a network: network1 and a subnet: subnet1, add interface of subnet1 to router1.
  - disconnect host1 from the overlay network, wait until the l3-agent-list-hosting-router api show that the two ha_state of router1 are both active.
  - restore the overlay network of host1, wait until one ha_state of router1 turn to standby. There is a 50% probability that the port binding of the interface of router1 would be inconsistent with the host hosting the active node. Then instances in subnet1 can't reach the router interface.

  Expected behavior:
  - update ha_state of a HA router to standby should trigger to update port binding of the router interfaces to the host whose ha_state is active.

  Affected versions:
  can be reproduced in master branch, guess mitaka and newton are also affected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1636466/+subscriptions


Follow ups