← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1871850] Re: [L3] existing router resources are partial deleted unexceptedly when MQ is gone

 

Reviewed:  https://review.opendev.org/719127
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=12b9149e20665d80c11f1ef3d2283e1fa6f3b693
Submitter: Zuul
Branch:    master

commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693
Author: LIU Yulong <i@xxxxxxxxxxxx>
Date:   Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable
    
    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.
    
    This patch directly removes the final router removal action, let the
    router run as it is.
    
    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1871850

Title:
  [L3] existing router resources are partial deleted unexceptedly when
  MQ is gone

Status in neutron:
  Fix Released

Bug description:
  ENV: meet this issue on our stable/queens deployment, but master
  branch has the same code logic

  When the L3 agent get a router update notification, it will try to
  retrieve the router info from DB server [1]. But at this time, if the
  message queue is down/unreachable. It will get exceptions related
  message queue. A resync action will be run then [2]. Sometimes, from
  my personal experience, rabbitMQ cluster is not so much easy to
  recover. Long time MQ recover time will cause the router info sync RPC
  never get successful until it meets the max retry time [3]. So the bad
  thing happens, L3 agent is trying to remove the router now [4]. It
  basically shutdown all the existing L3 traffic of this router.

  [1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L705
  [2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L710
  [3] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L666
  [4] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L671

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1871850/+subscriptions


References