yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #71668
[Bug 1755810] [NEW] concurrent calls to _bind_port_if_needed results in port stuck in DOWN status
Public bug reported:
There is a concurrency issue in _bind_port_if_needed [1] that is leading
to a missing RPC notification which in turn results in a port stuck in
DOWN status following a live-migration. If update_port [2] is run
concurrently with any other API or RPC that needs to call
_bind_port_if_needed then it is possible that the _bind_port_if_needed
originating from update_port will fail to send an RPC notification
because the PortContext does not have any binding levels which causes
_notify_port_updated [3] to suppress the notification.
For example, if get_device_details [4] runs concurrently with
update_port it will call get_bound_port_context [5] which will call
_bind_port_if_needed(notify=False) while update_port will call
_bind_port_if_needed(notify=True). If the call made by update_port
commits the port binding first then there is no issue, but if the call
made by get_device_details finishes first then there is no RPC
notification sent to the agent. If the RPC notification is not sent to
the agent then the port will remain stuck in the DOWN status until
another port update forces the agent to act. If this happens early in
the live-migration there is a possibility that the system will auto-
correct if another port_update happens, but if the issue happens on the
last port update in the live-migration then the port will remain stuck.
A port stuck in the DOWN status has negative effects on consumers of the
L2Population functionality because the L2Population mechanism driver
will not be triggered to publish that a port is UP on a given compute
node.
The issue coincides with the occurrence of this log:
2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-
ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8
2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind
port ea5e524e-e7d4-4fec-a491-11f80f1de4a7
On the first iteration thru _bind_port_if_needed the context returned by
_attempt_binding [6] has proper binding levels set on the PortContext,
but the subsequence call to _commit_port_binding [7] replaces the
PortContext with a new instance which does not have any binding levels
set. That new PortContext is returned and used within
_bind_port_if_needed during the second iteration. During that second
iteration the call to _attempt_binding returns without doing anything
because _should_bind_port [6] returns False. _bind_port_if_needed then
proceeds to call _notify_port_updated [3] which does nothing due to the
missing binding_levels.
This was discovered by our product test group using a simple setup of 2
compute nodes and a single VM that was being live-migrated between the
two nodes. The VM was configured with 3 ports. Over ~1000 live
migrations this happened between 5 and 10 times and each time caused
loss of communication to the VM instance as the agents were not given
the latest L2Population data because the port appeared DOWN in the
database. Manual intervention was required to set the port
admin_state_up=False and then back to True to trigger an RPC
notification to the agent to update the port status to UP.
This was observed in stable/pike but looking at the code in master I
don't see that it would behave any differently.
[1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
[2] plugins.ml2.plugin.Ml2Plugin#update_port
[3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
[4] plugins.ml2.rpc.RpcCallbacks#get_device_details
[5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
[6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
[7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1755810
Title:
concurrent calls to _bind_port_if_needed results in port stuck in DOWN
status
Status in neutron:
New
Bug description:
There is a concurrency issue in _bind_port_if_needed [1] that is
leading to a missing RPC notification which in turn results in a port
stuck in DOWN status following a live-migration. If update_port [2]
is run concurrently with any other API or RPC that needs to call
_bind_port_if_needed then it is possible that the _bind_port_if_needed
originating from update_port will fail to send an RPC notification
because the PortContext does not have any binding levels which causes
_notify_port_updated [3] to suppress the notification.
For example, if get_device_details [4] runs concurrently with
update_port it will call get_bound_port_context [5] which will call
_bind_port_if_needed(notify=False) while update_port will call
_bind_port_if_needed(notify=True). If the call made by update_port
commits the port binding first then there is no issue, but if the call
made by get_device_details finishes first then there is no RPC
notification sent to the agent. If the RPC notification is not sent
to the agent then the port will remain stuck in the DOWN status until
another port update forces the agent to act. If this happens early in
the live-migration there is a possibility that the system will auto-
correct if another port_update happens, but if the issue happens on
the last port update in the live-migration then the port will remain
stuck.
A port stuck in the DOWN status has negative effects on consumers of
the L2Population functionality because the L2Population mechanism
driver will not be triggered to publish that a port is UP on a given
compute node.
The issue coincides with the occurrence of this log:
2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-
ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8
2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind
port ea5e524e-e7d4-4fec-a491-11f80f1de4a7
On the first iteration thru _bind_port_if_needed the context returned
by _attempt_binding [6] has proper binding levels set on the
PortContext, but the subsequence call to _commit_port_binding [7]
replaces the PortContext with a new instance which does not have any
binding levels set. That new PortContext is returned and used within
_bind_port_if_needed during the second iteration. During that second
iteration the call to _attempt_binding returns without doing anything
because _should_bind_port [6] returns False. _bind_port_if_needed
then proceeds to call _notify_port_updated [3] which does nothing due
to the missing binding_levels.
This was discovered by our product test group using a simple setup of
2 compute nodes and a single VM that was being live-migrated between
the two nodes. The VM was configured with 3 ports. Over ~1000 live
migrations this happened between 5 and 10 times and each time caused
loss of communication to the VM instance as the agents were not given
the latest L2Population data because the port appeared DOWN in the
database. Manual intervention was required to set the port
admin_state_up=False and then back to True to trigger an RPC
notification to the agent to update the port status to UP.
This was observed in stable/pike but looking at the code in master I
don't see that it would behave any differently.
[1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
[2] plugins.ml2.plugin.Ml2Plugin#update_port
[3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
[4] plugins.ml2.rpc.RpcCallbacks#get_device_details
[5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
[6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
[7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1755810/+subscriptions
Follow ups