← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1755810] Re: concurrent calls to _bind_port_if_needed results in port stuck in DOWN status

 

Reviewed:  https://review.opendev.org/606827
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443
Submitter: Zuul
Branch:    master

commit 0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443
Author: Kailun Qin <kailun.qin@xxxxxxxxxxx>
Date:   Wed May 1 15:50:48 2019 +0800

    Populate binding levels when concurrent ops fail
    
    Concurrent calls to _bind_port_if_needed may lead to a missing RPC
    notification which can cause a port stuck in a DOWN state. If the only
    caller that succeeds in the concurrency does not specify that an RPC
    notification is allowed then no RPC would be sent to the agent. The
    other caller which needs to send an RPC notification will fail since the
    resulting PortContext instance will not have any binding levels set.
    
    The failure has negative effects on consumers of the L2Population
    functionality because the L2Population mechanism driver will not be
    triggered to publish that a port is UP on a given compute node. Manual
    intervention is required in this case.
    
    This patch proposes to handle this by populating the PortContext with
    the current binding levels so that the caller can continue on and have
    an RPC notification sent out.
    
    Closes-Bug: #1755810
    Story: 2003922
    Change-Id: Ie2b813b2bdf181fb3c24743dbd13487ace6ee76a


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1755810

Title:
  concurrent calls to _bind_port_if_needed results in port stuck in DOWN
  status

Status in neutron:
  Fix Released

Bug description:
  There is a concurrency issue in _bind_port_if_needed [1] that is
  leading to a missing RPC notification which in turn results in a port
  stuck in DOWN status following a live-migration.  If update_port [2]
  is run concurrently with any other API or RPC that needs to call
  _bind_port_if_needed then it is possible that the _bind_port_if_needed
  originating from update_port will fail to send an RPC notification
  because the PortContext does not have any binding levels which causes
  _notify_port_updated [3] to suppress the notification.

  For example, if get_device_details [4] runs concurrently with
  update_port it will call get_bound_port_context [5] which will call
  _bind_port_if_needed(notify=False) while update_port will call
  _bind_port_if_needed(notify=True).  If the call made by update_port
  commits the port binding first then there is no issue, but if the call
  made by get_device_details finishes first then there is no RPC
  notification sent to the agent.   If the RPC notification is not sent
  to the agent then the port will remain stuck in the DOWN status until
  another port update forces the agent to act.  If this happens early in
  the live-migration there is a possibility that the system will auto-
  correct if another port_update happens, but if the issue happens on
  the last port update in the live-migration then the port will remain
  stuck.

  A port stuck in the DOWN status has negative effects on consumers of
  the L2Population functionality because the L2Population mechanism
  driver will not be triggered to publish that a port is UP on a given
  compute node.

  The issue coincides with the occurrence of this log:

     2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-
  ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8
  2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind
  port ea5e524e-e7d4-4fec-a491-11f80f1de4a7

  On the first iteration thru _bind_port_if_needed the context returned
  by _attempt_binding [6] has proper binding levels set on the
  PortContext, but the subsequence call to _commit_port_binding [7]
  replaces the PortContext with a new instance which does not have any
  binding levels set.  That new PortContext is returned and used within
  _bind_port_if_needed during the second iteration.  During that second
  iteration the call to _attempt_binding returns without doing anything
  because _should_bind_port [6] returns False.  _bind_port_if_needed
  then proceeds to call _notify_port_updated [3] which does nothing due
  to the missing binding_levels.

  This was discovered by our product test group using a simple setup of
  2 compute nodes and a single VM that was being live-migrated between
  the two nodes.  The VM was configured with 3 ports.  Over ~1000 live
  migrations this happened between 5 and 10 times and each time caused
  loss of communication to the VM instance as the agents were not given
  the latest L2Population data because the port appeared DOWN in the
  database.  Manual intervention was required to set the port
  admin_state_up=False and then back to True to trigger an RPC
  notification to the agent to update the port status to UP.

  This was observed in stable/pike but looking at the code in master I
  don't see that it would behave any differently.

  
  [1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
  [2] plugins.ml2.plugin.Ml2Plugin#update_port
  [3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
  [4] plugins.ml2.rpc.RpcCallbacks#get_device_details
  [5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
  [6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
  [7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1755810/+subscriptions


References