yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1533455] [NEW] Stale processes lives after a fanout deleting HA router RPC between L3 agents

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: LIU Yulong <yulong@xxxxxxxxxxxxx>
Date: Wed, 13 Jan 2016 03:18:26 -0000
Reply-to: Bug 1533455 <1533455@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Stale processes lives after a fanout deleting HA router RPC between L3
agents:

The race happened between l3 agents after a fanout deleting HA router RPC. Race Scenario:
1. HA router X was schedulered to L3 agent A and L3 agent B

2. X in L3 agent A is the master state

3. a delete X RPC fanout

4. agent A delete all X HA attributes and processes including keepalived

5. （race） agent B was not ready to process the deleting RPC,
assume there are a lot of deleting RPC is in the router update
queue, or anything cause the agent B delay processing the RPC.

6. (race) X in agent B is backup state, now it can not get the VRRP
advertisement from X in agent A because of the 4, so X set it's state to
master

8. (race) enqueue_state_change for X in agent B

9. (race) agent B could process the deleting RPC

10. (race) X is still in agent B router_info, so spawn the metadata-
proxy

11. (race) agent B do deleting process for HA router X gateway, floating
IP etc.

12. (race) agent B remove X from router info

13. 13. metadata-proxy for router X in agent B lives.

If you have tried to use rally to run create_and_delete_routers, you
will find the l3 agent side will have some stale metadata-proxy
processes after the rally test.

The only way to decide whether to spawn the metedata-proxy is to try get
router in agent router_info dict. But enqueue_state_change and
processing router deleting can be run concurrently.

** Affects: neutron
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1533455

Title:
Stale processes lives after a fanout deleting HA router RPC between L3
agents

Status in neutron:
New

Bug description:
Stale processes lives after a fanout deleting HA router RPC between L3
agents:

The race happened between l3 agents after a fanout deleting HA router RPC. Race Scenario:
1. HA router X was schedulered to L3 agent A and L3 agent B

2. X in L3 agent A is the master state

3. a delete X RPC fanout

4. agent A delete all X HA attributes and processes including
keepalived

5. （race） agent B was not ready to process the deleting RPC,
assume there are a lot of deleting RPC is in the router update
queue, or anything cause the agent B delay processing the RPC.

6. (race) X in agent B is backup state, now it can not get the VRRP
advertisement from X in agent A because of the 4, so X set it's state
to master

8. (race) enqueue_state_change for X in agent B

9. (race) agent B could process the deleting RPC

10. (race) X is still in agent B router_info, so spawn the metadata-
proxy

11. (race) agent B do deleting process for HA router X gateway,
floating IP etc.

12. (race) agent B remove X from router info

13. 13. metadata-proxy for router X in agent B lives.

If you have tried to use rally to run create_and_delete_routers, you
will find the l3 agent side will have some stale metadata-proxy
processes after the rally test.

The only way to decide whether to spawn the metedata-proxy is to try
get router in agent router_info dict. But enqueue_state_change and
processing router deleting can be run concurrently.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1533455/+subscriptions

Follow ups

[Bug 1533455] Re: Stale processes lives after a fanout deleting HA router RPC between L3 agents
From: Rodolfo Alonso, 2022-11-30