yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #84253
[Bug 1901707] [NEW] race condition on port binding vs instance being resumed for live-migrations
Public bug reported:
This is a separation from the discussion in this bug
https://bugs.launchpad.net/neutron/+bug/1815989
There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in
detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0
There is a race condition where nova live-migration will wait for
neutron to send the network-vif-plugged event but when nova receives
that event the live migration is faster than the OVS l2 agent can bind
the port on the destination compute node.
This causes the RARP frames sent out to update the switches ARP tables
to fail causing the instance to be completely unaccessible after a live
migration unless these RARP frames are sent again or traffic is
initiated egress from the instance.
See Sean's comments after for the view from the Nova side. The correct
behavior should be that the port is ready for use when nova get's the
external event, but maybe that is not possible from the neutron side,
again see comments in the other bug.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1901707
Title:
race condition on port binding vs instance being resumed for live-
migrations
Status in neutron:
New
Bug description:
This is a separation from the discussion in this bug
https://bugs.launchpad.net/neutron/+bug/1815989
There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in
detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0
There is a race condition where nova live-migration will wait for
neutron to send the network-vif-plugged event but when nova receives
that event the live migration is faster than the OVS l2 agent can bind
the port on the destination compute node.
This causes the RARP frames sent out to update the switches ARP tables
to fail causing the instance to be completely unaccessible after a
live migration unless these RARP frames are sent again or traffic is
initiated egress from the instance.
See Sean's comments after for the view from the Nova side. The correct
behavior should be that the port is ready for use when nova get's the
external event, but maybe that is not possible from the neutron side,
again see comments in the other bug.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1901707/+subscriptions
Follow ups