yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2009043] [NEW] neutron-l3-agent restart some random ha routers get wrong state

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Maximilian Stinsky <2009043@xxxxxxxxxxxxxxxxxx>
Date: Thu, 02 Mar 2023 14:18:58 -0000
Reply-to: Bug 2009043 <2009043@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Since a couple of weeks we have a problem in our production environment
when restarting our l3-agent. (Our assumption is that this might has
something to do with our upgrade to wallaby, as we never saw this
problem on prior releases before.)

The l3 agent is hosting around 300 ha routers so when restarting the
agent it takes a couple of seconds which results in the alive state to
go down and therefore all active routers that were hosted on that agent
flip to standby state. Now when the agent finished its startup it should
set the correct active state for its routers again but fails for some
random amount of routers. It does not log any exceptions or errors so we
started to debug this problem in our lab environment which has at most
10-20 routers.

To reproduce this we stopped an l3-agent completely until the alive
state is down and routers flip into standy, after starting the agent
again some states as in production also dont get back into active again.

We dug quite deep into the code and what we see for routers that are not
functioning correctly is that they only get into the
_process_added_router function [1] and never go into the
_process_updated_router function [2]

For all other routers that work we see that they first hit [1] and then
a couple of seconds later they go into [2] which then sets the correct
state again.

What is quite confusing is that it happens for different routers on each
stop/start sequence of the l3-agent and restarting an agent sometimes
fixes this and sometimes it does not.

At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from.
Does anyone has an idea where this could be broken or point us in any direction how to debug this further?

Neutron is running on wallaby(18.5.0).

Thanks in advance

[1] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L631
[2] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L633

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2009043

Title:
  neutron-l3-agent restart some random ha routers get wrong state

Status in neutron:
  New

Bug description:
  Since a couple of weeks we have a problem in our production
  environment when restarting our l3-agent. (Our assumption is that this
  might has something to do with our upgrade to wallaby, as we never saw
  this problem on prior releases before.)

  The l3 agent is hosting around 300 ha routers so when restarting the
  agent it takes a couple of seconds which results in the alive state to
  go down and therefore all active routers that were hosted on that
  agent flip to standby state. Now when the agent finished its startup
  it should set the correct active state for its routers again but fails
  for some random amount of routers. It does not log any exceptions or
  errors so we started to debug this problem in our lab environment
  which has at most 10-20 routers.

  To reproduce this we stopped an l3-agent completely until the alive
  state is down and routers flip into standy, after starting the agent
  again some states as in production also dont get back into active
  again.

  We dug quite deep into the code and what we see for routers that are
  not functioning correctly is that they only get into the
  _process_added_router function [1] and never go into the
  _process_updated_router function [2]

  For all other routers that work we see that they first hit [1] and
  then a couple of seconds later they go into [2] which then sets the
  correct state again.

  What is quite confusing is that it happens for different routers on
  each stop/start sequence of the l3-agent and restarting an agent
  sometimes fixes this and sometimes it does not.

  At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from.
  Does anyone has an idea where this could be broken or point us in any direction how to debug this further?

  Neutron is running on wallaby(18.5.0).

  Thanks in advance

  [1] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L631
  [2] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L633

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2009043/+subscriptions