← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1943725] [NEW] Automatic cleanup of BGP speakers is too aggressive

 

Public bug reported:

We have seen regular issues with the neutron-bgp-dragent service when
one or more network nodes fail or are undergoing maintenance.

In the most problematic case, we have a deployment with four network
nodes. Each of these runs a neutron-bgp-dragent process, and each is
associated with the same BGP speaker. When one of these network nodes
goes down, a short time later a cleanup process runs, but rather than
just removing the speaker association from the absent network node, they
are removed from all but one of them.

During this process, the running neutron-bgp-dragent processes report
errors such as the following (observed using the latest neutron-dynamic-
routing code from stable/victoria):

"Unable to sync BGP speaker state.: RuntimeError: dictionary changed
size during iteration"

or

Sep 15 13:36:26 neutron-bgp-dragent[1308396]: 2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server [req-094dad10-b4da-4c50-8e32-f7814d446705 - - - - -] Exception during message handling: TypeError: unhashable type: 'dict'
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 185, in bgp_speaker_create_end
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.add_bgp_speaker_helper(bgp_speaker_id)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 249, in add_bgp_speaker_helper
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.add_bgp_speaker_on_dragent(bgp_speaker)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 344, in add_bgp_speaker_on_dragent
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.cache.put_bgp_speaker(bgp_speaker)
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 582, in put_bgp_speaker
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.remove_bgp_speaker_by_id(self.cache[bgp_speaker['id']])
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 600, in remove_bgp_speaker_by_id
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     if bgp_speaker_id in self.cache:
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server TypeError: unhashable type: 'dict'
2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server


This issue appears to match a comment here: https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/780675/3#message-c409a4fb83a44216e03a041921c7067f44eb70d0

We will test out https://review.opendev.org/c/openstack/neutron-dynamic-
routing/+/780675 as in our case the automatic behaviour appears mostly
unnecessary, but a fix for the underlying issue would still be
appreciated.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1943725

Title:
  Automatic cleanup of BGP speakers is too aggressive

Status in neutron:
  New

Bug description:
  We have seen regular issues with the neutron-bgp-dragent service when
  one or more network nodes fail or are undergoing maintenance.

  In the most problematic case, we have a deployment with four network
  nodes. Each of these runs a neutron-bgp-dragent process, and each is
  associated with the same BGP speaker. When one of these network nodes
  goes down, a short time later a cleanup process runs, but rather than
  just removing the speaker association from the absent network node,
  they are removed from all but one of them.

  During this process, the running neutron-bgp-dragent processes report
  errors such as the following (observed using the latest neutron-
  dynamic-routing code from stable/victoria):

  "Unable to sync BGP speaker state.: RuntimeError: dictionary changed
  size during iteration"

  or

  Sep 15 13:36:26 neutron-bgp-dragent[1308396]: 2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server [req-094dad10-b4da-4c50-8e32-f7814d446705 - - - - -] Exception during message handling: TypeError: unhashable type: 'dict'
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 185, in bgp_speaker_create_end
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.add_bgp_speaker_helper(bgp_speaker_id)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 249, in add_bgp_speaker_helper
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.add_bgp_speaker_on_dragent(bgp_speaker)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 344, in add_bgp_speaker_on_dragent
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.cache.put_bgp_speaker(bgp_speaker)
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 582, in put_bgp_speaker
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     self.remove_bgp_speaker_by_id(self.cache[bgp_speaker['id']])
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/neutron-22.1.3/lib/python3.8/site-packages/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py", line 600, in remove_bgp_speaker_by_id
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server     if bgp_speaker_id in self.cache:
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server TypeError: unhashable type: 'dict'
  2021-09-15 13:36:26.605 1308396 ERROR oslo_messaging.rpc.server

  
  This issue appears to match a comment here: https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/780675/3#message-c409a4fb83a44216e03a041921c7067f44eb70d0

  We will test out https://review.opendev.org/c/openstack/neutron-
  dynamic-routing/+/780675 as in our case the automatic behaviour
  appears mostly unnecessary, but a fix for the underlying issue would
  still be appreciated.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1943725/+subscriptions