← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1755810] [NEW] concurrent calls to _bind_port_if_needed results in port stuck in DOWN status

 

Public bug reported:

There is a concurrency issue in _bind_port_if_needed [1] that is leading
to a missing RPC notification which in turn results in a port stuck in
DOWN status following a live-migration.  If update_port [2] is run
concurrently with any other API or RPC that needs to call
_bind_port_if_needed then it is possible that the _bind_port_if_needed
originating from update_port will fail to send an RPC notification
because the PortContext does not have any binding levels which causes
_notify_port_updated [3] to suppress the notification.

For example, if get_device_details [4] runs concurrently with
update_port it will call get_bound_port_context [5] which will call
_bind_port_if_needed(notify=False) while update_port will call
_bind_port_if_needed(notify=True).  If the call made by update_port
commits the port binding first then there is no issue, but if the call
made by get_device_details finishes first then there is no RPC
notification sent to the agent.   If the RPC notification is not sent to
the agent then the port will remain stuck in the DOWN status until
another port update forces the agent to act.  If this happens early in
the live-migration there is a possibility that the system will auto-
correct if another port_update happens, but if the issue happens on the
last port update in the live-migration then the port will remain stuck.

A port stuck in the DOWN status has negative effects on consumers of the
L2Population functionality because the L2Population mechanism driver
will not be triggered to publish that a port is UP on a given compute
node.

The issue coincides with the occurrence of this log:

   2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-
ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8
2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind
port ea5e524e-e7d4-4fec-a491-11f80f1de4a7

On the first iteration thru _bind_port_if_needed the context returned by
_attempt_binding [6] has proper binding levels set on the PortContext,
but the subsequence call to _commit_port_binding [7] replaces the
PortContext with a new instance which does not have any binding levels
set.  That new PortContext is returned and used within
_bind_port_if_needed during the second iteration.  During that second
iteration the call to _attempt_binding returns without doing anything
because _should_bind_port [6] returns False.  _bind_port_if_needed then
proceeds to call _notify_port_updated [3] which does nothing due to the
missing binding_levels.

This was discovered by our product test group using a simple setup of 2
compute nodes and a single VM that was being live-migrated between the
two nodes.  The VM was configured with 3 ports.  Over ~1000 live
migrations this happened between 5 and 10 times and each time caused
loss of communication to the VM instance as the agents were not given
the latest L2Population data because the port appeared DOWN in the
database.  Manual intervention was required to set the port
admin_state_up=False and then back to True to trigger an RPC
notification to the agent to update the port status to UP.

This was observed in stable/pike but looking at the code in master I
don't see that it would behave any differently.


[1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
[2] plugins.ml2.plugin.Ml2Plugin#update_port
[3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
[4] plugins.ml2.rpc.RpcCallbacks#get_device_details
[5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
[6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
[7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1755810

Title:
  concurrent calls to _bind_port_if_needed results in port stuck in DOWN
  status

Status in neutron:
  New

Bug description:
  There is a concurrency issue in _bind_port_if_needed [1] that is
  leading to a missing RPC notification which in turn results in a port
  stuck in DOWN status following a live-migration.  If update_port [2]
  is run concurrently with any other API or RPC that needs to call
  _bind_port_if_needed then it is possible that the _bind_port_if_needed
  originating from update_port will fail to send an RPC notification
  because the PortContext does not have any binding levels which causes
  _notify_port_updated [3] to suppress the notification.

  For example, if get_device_details [4] runs concurrently with
  update_port it will call get_bound_port_context [5] which will call
  _bind_port_if_needed(notify=False) while update_port will call
  _bind_port_if_needed(notify=True).  If the call made by update_port
  commits the port binding first then there is no issue, but if the call
  made by get_device_details finishes first then there is no RPC
  notification sent to the agent.   If the RPC notification is not sent
  to the agent then the port will remain stuck in the DOWN status until
  another port update forces the agent to act.  If this happens early in
  the live-migration there is a possibility that the system will auto-
  correct if another port_update happens, but if the issue happens on
  the last port update in the live-migration then the port will remain
  stuck.

  A port stuck in the DOWN status has negative effects on consumers of
  the L2Population functionality because the L2Population mechanism
  driver will not be triggered to publish that a port is UP on a given
  compute node.

  The issue coincides with the occurrence of this log:

     2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-
  ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8
  2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind
  port ea5e524e-e7d4-4fec-a491-11f80f1de4a7

  On the first iteration thru _bind_port_if_needed the context returned
  by _attempt_binding [6] has proper binding levels set on the
  PortContext, but the subsequence call to _commit_port_binding [7]
  replaces the PortContext with a new instance which does not have any
  binding levels set.  That new PortContext is returned and used within
  _bind_port_if_needed during the second iteration.  During that second
  iteration the call to _attempt_binding returns without doing anything
  because _should_bind_port [6] returns False.  _bind_port_if_needed
  then proceeds to call _notify_port_updated [3] which does nothing due
  to the missing binding_levels.

  This was discovered by our product test group using a simple setup of
  2 compute nodes and a single VM that was being live-migrated between
  the two nodes.  The VM was configured with 3 ports.  Over ~1000 live
  migrations this happened between 5 and 10 times and each time caused
  loss of communication to the VM instance as the agents were not given
  the latest L2Population data because the port appeared DOWN in the
  database.  Manual intervention was required to set the port
  admin_state_up=False and then back to True to trigger an RPC
  notification to the agent to update the port status to UP.

  This was observed in stable/pike but looking at the code in master I
  don't see that it would behave any differently.

  
  [1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
  [2] plugins.ml2.plugin.Ml2Plugin#update_port
  [3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
  [4] plugins.ml2.rpc.RpcCallbacks#get_device_details
  [5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
  [6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
  [7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1755810/+subscriptions


Follow ups