yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1916022] [NEW] L3HA Race condition during startup of the agent may cause inconsistent router's states

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Slawek Kaplonski <1916022@xxxxxxxxxxxxxxxxxx>
Date: Thu, 18 Feb 2021 11:44:27 -0000
Reply-to: Bug 1916022 <1916022@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

I observed that issue in Tobiko jobs, like e.g.
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
/devstack-tobiko-gate-
multinode/257fd87/tobiko_results_05_verify_resources_scenario.html

Problem with HA routers. What happens there is that when neutron-l3-agent and then keepalived on node which is master is killed, new node becomes master but VIP address isn't removed from the qrouter namespace.
Then some other node becomes new master as keepalived on that running nodes did its job.
When stopped agent is started it first calls update_initial_state() https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
which will enqueue state change event and may do it with "primary" state (it's old state from before agent and keepalived was down.
And immediately after that, it will also spawn state change monitor. And that monitor will also enqueue state change event. This one may be with correct "backup" state already. But as there was already state "primary" scheduled to be processed, new one will be dropped.
And due to that it will end up with 2 nodes in "primary" state.

I think that calling of update_initial_state() isn't really needed as
state change monitor is handling notification of the initial state
always just after start of the process.

** Affects: neutron
Importance: Low
Assignee: Slawek Kaplonski (slaweq)
Status: Confirmed

** Tags: l3-ha tobiko

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916022

Title:
L3HA Race condition during startup of the agent may cause inconsistent
router's states

Status in neutron:
Confirmed

Bug description:
I observed that issue in Tobiko jobs, like e.g.
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
/devstack-tobiko-gate-
multinode/257fd87/tobiko_results_05_verify_resources_scenario.html

I think that calling of update_initial_state() isn't really needed as
state change monitor is handling notification of the initial state
always just after start of the process.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916022/+subscriptions

Follow ups

[Bug 1916022] Re: L3HA Race condition during startup of the agent may cause inconsistent router's states
From: Edward Hope-Morley, 2021-03-25