← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1533455] Re: Stale processes lives after a fanout deleting HA router RPC between L3 agents

 

Bug closed due to lack of activity, please feel free to reopen if
needed.

** Changed in: neutron
       Status: In Progress => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1533455

Title:
  Stale processes lives after a fanout deleting HA router RPC between L3
  agents

Status in neutron:
  Won't Fix

Bug description:
  Stale processes lives after a fanout deleting HA router RPC between L3
  agents:

  The race happened between l3 agents after a fanout deleting HA router RPC. Race Scenario:
  1. HA router X was schedulered to L3 agent A and L3 agent B

  2. X in L3 agent A is the master state

  3. a delete X RPC fanout

  4. agent A delete all X HA attributes and processes including
  keepalived

  5. (race) agent B was not ready to process the deleting RPC,
  assume there are a lot of deleting RPC is in the router update
  queue, or anything cause the agent B delay processing the RPC.

  6. (race) X in agent B is backup state, now it can not get the VRRP
  advertisement from X in agent A because of the 4, so X set it's state
  to master

  8. (race) enqueue_state_change for X in agent B

  9. (race) agent B could process the deleting RPC

  10. (race) X is still in agent B router_info, so spawn the metadata-
  proxy

  11. (race) agent B do deleting process for HA router X gateway,
  floating IP etc.

  12. (race) agent B remove X from router info

  13. metadata-proxy for router X in agent B lives.

  If you have tried to use rally to run create_and_delete_routers, you
  will find the l3 agent side will have some stale metadata-proxy
  processes after the rally test.

  The only way to decide whether to spawn the metedata-proxy is to try
  get router in agent router_info dict. But enqueue_state_change and
  processing router deleting can be run concurrently.


  
  Here are some statistics after running Rally create_and_delete_routers:

  yulong@network2:/opt/openstack/neutron$ ~/ha_resource_state.sh

  neutron-keepalived-state-change count:
  0
  neutron-ns-metadata-proxy count:
  2
  keepalived process count:
  0
  HA router master state count:
  0
  IP monitor count:
  9
  external pids:
  2
  -rwxr-xr-x 1 root root 5 Mar  7 17:21 /opt/openstack/data/neutron/external/pids/5a83fe00-37c9-45fa-b299-2a1c49ce4bcc.pid
  -rwxr-xr-x 1 root root 5 Mar  7 17:20 /opt/openstack/data/neutron/external/pids/d9e2bdd3-63ac-4302-bb06-2f66e0308292.pid
  HA interface ip:
  all metadata-proxy router id:
  d9e2bdd3-63ac-4302-bb06-2f66e0308292
  5a83fe00-37c9-45fa-b299-2a1c49ce4bcc
  all ovs ha ports:
  0
  all router namespace:
  0

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1533455/+subscriptions



References