← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1731595] Re: L3 HA: multiple agents are active at the same time

 

Reviewed:  https://review.openstack.org/522641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9ed693228f90251c0f03fb842ef19628b439f9bc
Submitter: Zuul
Branch:    master

commit 9ed693228f90251c0f03fb842ef19628b439f9bc
Author: venkata anil <anilvenkata@xxxxxxxxxx>
Date:   Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start
    
    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.
    
    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.
    
    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.
    
    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    
    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1731595

Title:
  L3 HA: multiple agents are active at the same time

Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in Ubuntu Cloud Archive pike series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Triaged
Status in neutron source package in Zesty:
  Triaged
Status in neutron source package in Artful:
  Fix Committed
Status in neutron source package in Bionic:
  Triaged

Bug description:
  OS: Xenial, Ocata from Ubuntu Cloud Archive
  We have three neutron-gateway hosts, with L3 HA enabled and a min of 2, max of 3.  There are approx. 400 routers defined.

  At some point (we weren't monitoring exactly) a number of the routers
  changed from being one active, and 1+ others standby, to >1 active.
  This included each of the 'active' namespaces having the same IP
  addresses allocated, and therefore traffic problems reaching
  instances.

  Removing the routers from all but one agent, and re-adding, resolved
  the issue.  Restarting one l3 agent also appeared to resolve the
  issue, but very slowly, to the point where we needed the system alive
  again faster and reverted to removing/re-adding.

  At the same time, a number of routers were listed without any agents
  active at all.  This situation appears to have been resolved by adding
  routers to agents, after several minutes downtime.

  I'm finding it very difficult to find relevant keepalived messages to
  indicate what's going on, but what I do notice is that all the agents
  have equal priority and are configured as 'backup'.

  I am trying to figure out a way to get a reproducer of this, it might
  be that we need to have a large number of routers configured on a
  small number of gateways.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1731595/+subscriptions


References