← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1871850] [NEW] [L3] existing router resources are partial deleted unexceptedly when MQ is gone

 

Public bug reported:

ENV: meet this issue on our stable/queens deployment, but master branch
has the same code logic

When the L3 agent get a router update notification, it will try to
retrieve the router info from DB server [1]. But at this time, if the
message queue is down/unreachable. It will get exceptions related
message queue. A resync action will be run then [2]. Sometimes, from my
personal experience, rabbitMQ cluster is not so much easy to recover.
Long time MQ recover time will cause the router info sync RPC never get
successful until it meets the max retry time [3]. So the bad thing
happens, L3 agent is trying to remove the router now [4]. It basically
shutdown all the existing L3 traffic of this router.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L705
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L710
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L666
[4] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L671

** Affects: neutron
     Importance: Critical
         Status: Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1871850

Title:
  [L3] existing router resources are partial deleted unexceptedly when
  MQ is gone

Status in neutron:
  Confirmed

Bug description:
  ENV: meet this issue on our stable/queens deployment, but master
  branch has the same code logic

  When the L3 agent get a router update notification, it will try to
  retrieve the router info from DB server [1]. But at this time, if the
  message queue is down/unreachable. It will get exceptions related
  message queue. A resync action will be run then [2]. Sometimes, from
  my personal experience, rabbitMQ cluster is not so much easy to
  recover. Long time MQ recover time will cause the router info sync RPC
  never get successful until it meets the max retry time [3]. So the bad
  thing happens, L3 agent is trying to remove the router now [4]. It
  basically shutdown all the existing L3 traffic of this router.

  [1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L705
  [2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L710
  [3] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L666
  [4] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L671

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1871850/+subscriptions


Follow ups