← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1533454] Re: L3 agent unable to update HA router state after race between HA router creating and deleting

 

Reviewed:  https://review.openstack.org/265685
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=472d84d25cee0694500e583845718a4f377cc75c
Submitter: Jenkins
Branch:    master

commit 472d84d25cee0694500e583845718a4f377cc75c
Author: LIU Yulong <liuyulong@xxxxxxxx>
Date:   Mon Jan 11 12:02:55 2016 +0800

    Catch PortNotFound after HA router race condition
    
    When neutron server deleted all the resources of a
    HA router, L3 agent can not aware that, so race
    happened in some procedure like this:
    1. Neutron server delete all resources of a HA router.
    2. RPC fanout to L3 agent 1 in which the HA router was
       master state.
    3. In l3 agent 2 'backup' router set itself to masert
       and notify neutron server a HA router state change
       notify.
    4. PorNotFound rasied in updating router HA port status.
    
    How the step 2 and 3 happens?
    Consider that l3 agent 2 has much more HA routers than l3 agent 1,
    or any reason that causes l3 agent 2 gets/processes the deleting
    RPC later than l3 agent 1. Then l3 agent 1 remove HA router's
    keepalived process will soonly be detected by backup router in
    l3 agent 2 via VRRP protocol. Now the router deleting RPC is in
    the queue of RouterUpdate or any step of a HA router deleting
    procedure, and the router_info will still have 'the' router info.
    So l3 agent 2 will do the state change procedure, AKA notify
    the neutron server to update router state.
    
    This patch is mainly to deal with the race by catching the
    PorNotFound exception in neutron-server side.
    
    Change-Id: I34d7347595bfceb8a70685672a6287e1a44ede6b
    Closes-Bug: #1533454
    Related-Bug: #1523780


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1533454

Title:
  L3 agent unable to update HA router state after race between HA router
  creating and deleting

Status in neutron:
  Fix Released

Bug description:
  The router L3 HA binding process does not take into account the fact
  that the port it is binding to the agent can be concurrently deleted.

  Details:

  When neutron server deleted all the resources of a
  HA router, L3 agent can not aware that, so race
  happened in some procedure like this:
  1. Neutron server delete all resources of a HA router
  2. RPC fanout to L3 agent 1 in which
     the HA router was master state
  3. In l3 agent 2 'backup' router set itself to masert
     and notify neutron server a HA router state change notify.
  4. PortNotFound rasied in updating HA router states function
  (Seems the DB error was no longer existed.)

  How the step 2 and 3 happens?
  Consider that l3 agent 2 has much more HA routers than l3 agent 1,
  or any reason that causes l3 agent 2 gets/processes the deleting
  RPC later than l3 agent 1. Then l3 agent 1 remove HA router's
  keepalived process will soonly be detected by backup router in
  l3 agent 2 via VRRP protocol. Now the router deleting RPC is in
  the queue of RouterUpdate or any step of a HA router deleting
  procedure, and the router_info will still have 'the' router info.
  So l3 agent 2 will do the state change procedure, AKA notify
  the neutron server to update router state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1533454/+subscriptions


References