← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1916022] [NEW] L3HA Race condition during startup of the agent may cause inconsistent router's states

 

Public bug reported:

I observed that issue in Tobiko jobs, like e.g.
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
/devstack-tobiko-gate-
multinode/257fd87/tobiko_results_05_verify_resources_scenario.html

Problem with HA routers. What happens there is that when neutron-l3-agent and then keepalived on node which is master is killed, new node becomes master but VIP address isn't removed from the qrouter namespace.
Then some other node becomes new master as keepalived on that running nodes did its job.
When stopped agent is started it first calls update_initial_state() https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
which will enqueue state change event and may do it with "primary" state (it's old state from before agent and keepalived was down.
And immediately after that, it will also spawn state change monitor. And that monitor will also enqueue state change event. This one may be with correct "backup" state already. But as there was already state "primary" scheduled to be processed, new one will be dropped.
And due to that it will end up with 2 nodes in "primary" state.

I think that calling of update_initial_state() isn't really needed as
state change monitor is handling notification of the initial state
always just after start of the process.

** Affects: neutron
     Importance: Low
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed


** Tags: l3-ha tobiko

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916022

Title:
  L3HA Race condition during startup of the agent may cause inconsistent
  router's states

Status in neutron:
  Confirmed

Bug description:
  I observed that issue in Tobiko jobs, like e.g.
  https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
  /devstack-tobiko-gate-
  multinode/257fd87/tobiko_results_05_verify_resources_scenario.html

  Problem with HA routers. What happens there is that when neutron-l3-agent and then keepalived on node which is master is killed, new node becomes master but VIP address isn't removed from the qrouter namespace.
  Then some other node becomes new master as keepalived on that running nodes did its job.
  When stopped agent is started it first calls update_initial_state() https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
  which will enqueue state change event and may do it with "primary" state (it's old state from before agent and keepalived was down.
  And immediately after that, it will also spawn state change monitor. And that monitor will also enqueue state change event. This one may be with correct "backup" state already. But as there was already state "primary" scheduled to be processed, new one will be dropped.
  And due to that it will end up with 2 nodes in "primary" state.

  I think that calling of update_initial_state() isn't really needed as
  state change monitor is handling notification of the initial state
  always just after start of the process.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916022/+subscriptions


Follow ups