← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1631647] Re: Network downtime during live migration through routers

 

Bug closed due to lack of activity, please feel free to reopen if
needed.

** Changed in: neutron
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1631647

Title:
  Network downtime during live migration through routers

Status in neutron:
  Won't Fix

Bug description:
  neutron/master (close to stable/newton)
  VXLAN networks with simple network node (not DVR)

  There is network down time of several seconds during a live migration.
  The amount of time depends on when the VM resumes on the target host
  versus when the migration ‘completes’.

  When a live migration occurs, there is a point in its life cycle where
  it pauses on the source and starts up (or resumes) on the target.  At
  that point, the migration isn’t complete, the system has determined it
  is now best to be running on the target.  This of course varies per
  hypervisor, but that is the general flow for most hypervisors.

  So during the migration the port goes through a few states.
  1) Pre migration, its tied solely to the source host.
  2) During migration, its tied to the source host.  The port profile has a ‘migrating_to’ attribute that identifies the target host
  3) Post migration, the port is tied solely to the target host.

  
  The OVS agent handles the migration well.  It detects the port, sees the UUID, and treats the port properly.  But things like the router don’t seem to handle it properly, at least in my testing.

  It seems only once the VM hits step 3 (post migration, where nova
  updates the port to be on the target host solely) does the routing
  information get updated in the router.

  In fact, its kinda interesting.  I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump.  When it resumes on the target, but live migration is not completed the following happens:
   - Ping request goes out from target server
   - Goes through out the router
   - Comes back into the router
   - Gets sent to the source server

  I’m not sure if this is somehow specific to vxlan.  I haven’t had a
  chance to try Geneve yet.

  This could impact projects like Watcher which will be using the live-
  migration to constantly optimize the system.  But that could be
  undesirable to optimize because it would introduce down time on the
  workloads being moved around.

  If the time between a VM resume and live migration complete is
  minimal, then the impact can be quite small (couple seconds).  If KVM
  uses post-copy, it should be susceptible to it.
  http://wiki.qemu.org/Features/PostCopyLiveMigration

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1631647/+subscriptions



References