yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85194
[Bug 1916022] [NEW] L3HA Race condition during startup of the agent may cause inconsistent router's states
Public bug reported:
I observed that issue in Tobiko jobs, like e.g.
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
/devstack-tobiko-gate-
multinode/257fd87/tobiko_results_05_verify_resources_scenario.html
Problem with HA routers. What happens there is that when neutron-l3-agent and then keepalived on node which is master is killed, new node becomes master but VIP address isn't removed from the qrouter namespace.
Then some other node becomes new master as keepalived on that running nodes did its job.
When stopped agent is started it first calls update_initial_state() https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
which will enqueue state change event and may do it with "primary" state (it's old state from before agent and keepalived was down.
And immediately after that, it will also spawn state change monitor. And that monitor will also enqueue state change event. This one may be with correct "backup" state already. But as there was already state "primary" scheduled to be processed, new one will be dropped.
And due to that it will end up with 2 nodes in "primary" state.
I think that calling of update_initial_state() isn't really needed as
state change monitor is handling notification of the initial state
always just after start of the process.
** Affects: neutron
Importance: Low
Assignee: Slawek Kaplonski (slaweq)
Status: Confirmed
** Tags: l3-ha tobiko
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916022
Title:
L3HA Race condition during startup of the agent may cause inconsistent
router's states
Status in neutron:
Confirmed
Bug description:
I observed that issue in Tobiko jobs, like e.g.
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
/devstack-tobiko-gate-
multinode/257fd87/tobiko_results_05_verify_resources_scenario.html
Problem with HA routers. What happens there is that when neutron-l3-agent and then keepalived on node which is master is killed, new node becomes master but VIP address isn't removed from the qrouter namespace.
Then some other node becomes new master as keepalived on that running nodes did its job.
When stopped agent is started it first calls update_initial_state() https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
which will enqueue state change event and may do it with "primary" state (it's old state from before agent and keepalived was down.
And immediately after that, it will also spawn state change monitor. And that monitor will also enqueue state change event. This one may be with correct "backup" state already. But as there was already state "primary" scheduled to be processed, new one will be dropped.
And due to that it will end up with 2 nodes in "primary" state.
I think that calling of update_initial_state() isn't really needed as
state change monitor is handling notification of the initial state
always just after start of the process.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916022/+subscriptions
Follow ups