← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1429220] [NEW] libvirt does ensure live migration will eventually complete (or abort)

 

Public bug reported:

Currently the libvirt driver's approach to live migration is bested
characterized as "launch & pray".  It starts the live migration
operation and then just unconditionally waits for it to finish. It never
makes any attempt to tune its behaviour (for example changing max
downtime), nor does it look at the data transfer statistics to check if
it is making any progress, nor does it have any overall timeout.

It is not uncommon for guests to have workloads that will preclude live
migration from completing. Basically they can be dirtying guest RAM (or
block devices) faster than the network is able to transfer it to the
destination host. In such a case Nova will just leave the migration
running, burning up host CPU cycles and trashing network bandwidth until
the end of the universe.

There are many features exposed by libvirt, that Nova could be using to
do a better job, but the question is obviously ...which features and how
should they be used. Fortunately Nova is not the first project to come
across this problem. The oVirt data center mgmt project has the exact
same problem. So rather than trying to invent some new logic for Nova,
we should, as an immediate bug fix task, just copy the oVirt logic from
VDSM

https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430

If we get this out to users and then get real world feedback on how it
operates, we will have an idea of how/where to focus future ongoing
efforts.

** Affects: nova
     Importance: High
     Assignee: Daniel Berrange (berrange)
         Status: In Progress

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
     Assignee: (unassigned) => Daniel Berrange (berrange)

** Changed in: nova
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1429220

Title:
  libvirt does ensure live migration will eventually complete (or abort)

Status in OpenStack Compute (Nova):
  In Progress

Bug description:
  Currently the libvirt driver's approach to live migration is bested
  characterized as "launch & pray".  It starts the live migration
  operation and then just unconditionally waits for it to finish. It
  never makes any attempt to tune its behaviour (for example changing
  max downtime), nor does it look at the data transfer statistics to
  check if it is making any progress, nor does it have any overall
  timeout.

  It is not uncommon for guests to have workloads that will preclude
  live migration from completing. Basically they can be dirtying guest
  RAM (or block devices) faster than the network is able to transfer it
  to the destination host. In such a case Nova will just leave the
  migration running, burning up host CPU cycles and trashing network
  bandwidth until the end of the universe.

  There are many features exposed by libvirt, that Nova could be using
  to do a better job, but the question is obviously ...which features
  and how should they be used. Fortunately Nova is not the first project
  to come across this problem. The oVirt data center mgmt project has
  the exact same problem. So rather than trying to invent some new logic
  for Nova, we should, as an immediate bug fix task, just copy the oVirt
  logic from VDSM

  https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430

  If we get this out to users and then get real world feedback on how it
  operates, we will have an idea of how/where to focus future ongoing
  efforts.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1429220/+subscriptions


Follow ups

References