← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1901707] Re: race condition on port binding vs instance being resumed for live-migrations

 

adding nova as there is a nova element that need to be fixed also.

because nova was observing the network-vif-plugged event form the dhcp
agent we were not filtinging our wait condition on live migrate to only
wait for backend that had plugtime events.

so once this is fixed by rodolfos patch it actully breaks live migration
because we are waiting for an event that will never come until
https://review.opendev.org/c/openstack/nova/+/602432 is merged.

for backporting reasons i am working in a seperate trivial patch to only
wait for backends that send plugtime event. that patch will be
backported first allowing rodolfos patch to be backported before
https://review.opendev.org/c/openstack/nova/+/602432

i have 1 unit test left to update in the plug time patch and then ill
push it and reference this bug.

** Also affects: nova
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: New => Triaged

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
     Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1901707

Title:
  race condition on port binding vs instance being resumed for live-
  migrations

Status in neutron:
  In Progress
Status in OpenStack Compute (nova):
  Triaged

Bug description:
  This is a separation from the discussion in this bug
  https://bugs.launchpad.net/neutron/+bug/1815989

  There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in
  detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0

  There is a race condition where nova live-migration will wait for
  neutron to send the network-vif-plugged event but when nova receives
  that event the live migration is faster than the OVS l2 agent can bind
  the port on the destination compute node.

  This causes the RARP frames sent out to update the switches ARP tables
  to fail causing the instance to be completely unaccessible after a
  live migration unless these RARP frames are sent again or traffic is
  initiated egress from the instance.

  See Sean's comments after for the view from the Nova side. The correct
  behavior should be that the port is ready for use when nova get's the
  external event, but maybe that is not possible from the neutron side,
  again see comments in the other bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1901707/+subscriptions


References