← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2023993] [NEW] OVN: Removal of chassis results in unbalanced distribution of LRPs

 

Public bug reported:

Consider the following

Router  Priority    Chassis
r_a     5           gtw06
r_a     4           gtw05
r_a     3           gtw04
r_a     2           gtw03
r_a     1           gtw02

 
Note the r_a doesn't have any priority on gtw01 but now if we stop gtw06(using ovn-appctl exit) due to maintenance reasons, afterwards the situation becomes:

Router  Priority    Chassis
r_a     5           gtw05
r_a     4           gtw04
r_a     3           gtw03
r_a     2           gtw02
r_a     1           gtw01
 

So basically neutron slides down the priorities for that router, when it
detects that chassis(gtw06) is down, and I believe it does that to avoid
moving the active LRP more then once, as the router is already failed
over to prioity 4(gtw05), and when the gtw06 goes down and afterwards it
only updates gtw05 to priority 5 and similarly for the other
priorities<5.

And the issue arises because of that is when we have many priority 5
routers on gtw06, and the rescheduling(due to failover of the chassis)
doesn't result in a balanced distribution of the routers. And to resolve
that we currently have to run another external script to rebalances the
LRPs.

I am not yet sure if that is case by design and the operator has to make
sure they routers are rebalanced manually or if there is better solution
here so we have rebalanced the LRP while keeping in mind to have least
amount of failovers for the LRP.

Neutron version: Yoga

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2023993

Title:
  OVN: Removal of chassis results in unbalanced distribution of LRPs

Status in neutron:
  New

Bug description:
  Consider the following

  Router  Priority    Chassis
  r_a     5           gtw06
  r_a     4           gtw05
  r_a     3           gtw04
  r_a     2           gtw03
  r_a     1           gtw02

   
  Note the r_a doesn't have any priority on gtw01 but now if we stop gtw06(using ovn-appctl exit) due to maintenance reasons, afterwards the situation becomes:

  Router  Priority    Chassis
  r_a     5           gtw05
  r_a     4           gtw04
  r_a     3           gtw03
  r_a     2           gtw02
  r_a     1           gtw01
   

  So basically neutron slides down the priorities for that router, when
  it detects that chassis(gtw06) is down, and I believe it does that to
  avoid moving the active LRP more then once, as the router is already
  failed over to prioity 4(gtw05), and when the gtw06 goes down and
  afterwards it only updates gtw05 to priority 5 and similarly for the
  other priorities<5.

  And the issue arises because of that is when we have many priority 5
  routers on gtw06, and the rescheduling(due to failover of the chassis)
  doesn't result in a balanced distribution of the routers. And to
  resolve that we currently have to run another external script to
  rebalances the LRPs.

  I am not yet sure if that is case by design and the operator has to
  make sure they routers are rebalanced manually or if there is better
  solution here so we have rebalanced the LRP while keeping in mind to
  have least amount of failovers for the LRP.

  Neutron version: Yoga

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2023993/+subscriptions