← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1522980] Re: L3 HA integration with l2pop assumes control plane is operational for fail over

 

Reviewed:  https://review.openstack.org/255237
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26d8702b9d7cc5a4293b97bc435fa85983be9f01
Submitter: Jenkins
Branch:    master

commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01
Author: venkata anil <anilvenkata@xxxxxxxxxx>
Date:   Thu Aug 4 07:14:47 2016 +0000

    l2pop fdb flows for HA router ports
    
    This patch makes L3 HA failover not depended on neutron components
    (during failover).
    
    All HA agents(active and backup) call update_device_up/down after wiring
    the ports. But l2pop driver is called for only active agent as port
    binding in DB reflects active agent. Then l2pop creates unicast and
    multicast flows for active agent.
    On failover, flows to new active agent is created. For this to happen -
    all of database, messaging server, neutron-server and destination L3
    agent should be active during failover. This creates two issues -
    1) When any of the above resources(i.e neutron-server, .. ) are dead,
       flows between new master and other agents won't be created and
       L3 Ha failover is not working. In same scenario, L3 Ha failover will
       work if l2pop is disabled.
    2) Packet loss during failover is higher as above neutron resources
       interact multiple times, so will take time to create l2 flows.
    
    In this change, we allow plugin to notify l2pop when update_device_up/down
    is called by backup agents also. Then l2pop will create flood flows to
    all HA agents(both active and slave). L2pop won't create unicast flow for
    this port, instead unicast flow is created by learning action of table 10
    when keepalived sends GARP after assigning ip address to master router's
    qr-xx port. As flood flows are already created and unicast flow is
    dynamically added, L3 HA failover is not depended on l2pop.
    
    This solves two isses
    1) with L3 HA + l2pop, failover will work even if any of above agents
       or processes dead.
    2) Reduce failover time as we are not depending on neutron to create
       flows during failover.
    We use L3HARouterAgentPortBinding table for getting all HA agents of a
    router port. HA router port on slave agent is also considered for l2pop
    distributed_active_network_ports and agent_network_active_port_count
    
    Closes-bug: #1522980
    Closes-bug: #1602614
    Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1522980

Title:
  L3 HA integration with l2pop assumes control plane is operational for
  fail over

Status in neutron:
  Fix Released

Bug description:
  Note: This is a soft requirement for DVR + L3 HA.

  L3 HA did not work with l2pop at all, and that was fixed here:
  https://bugs.launchpad.net/neutron/+bug/1365476 via https://review.openstack.org/#/c/141114/.

  However, the solution is sub optimal because it assumes the control plane is operational for fail over to work correctly.
  Without l2pop, L3 HA can fail over successfully if the database, messaging server, neutron-server and destination L3 agent are dead. With l2pop, all four are needed. This is because for fail over to work, the destination L3 agent notices that a router has transitioned to master, and notifies neutron-server via RPC. At which point neutron-server updates all of the internal router port's 'binding:host' value to point to the target node, and l2pop code is executed in order to update the L2 agents.

  Instead, I'd like fail over to rely solely on the data plane
  regardless if l2pop is on or off. One such solution would be something
  similar to patch set 9 of the patch:
  https://review.openstack.org/#/c/141114/9//COMMIT_MSG. The idea is to
  tell l2pop to treat HA router ports as replicated ports (Which they
  are), so that tunnel endpoints would be created against all nodes that
  host replicas of the router, and the destination MAC address of the
  port would not be learned via l2pop, but via the fallback regular MAC
  learning mechanism. This means that we lost some of the advantage of
  l2pop, but I think it is essential to correct operation of L3 HA.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1522980/+subscriptions


References