yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #29116
[Bug 1429220] [NEW] libvirt does ensure live migration will eventually complete (or abort)
Public bug reported:
Currently the libvirt driver's approach to live migration is bested
characterized as "launch & pray". It starts the live migration
operation and then just unconditionally waits for it to finish. It never
makes any attempt to tune its behaviour (for example changing max
downtime), nor does it look at the data transfer statistics to check if
it is making any progress, nor does it have any overall timeout.
It is not uncommon for guests to have workloads that will preclude live
migration from completing. Basically they can be dirtying guest RAM (or
block devices) faster than the network is able to transfer it to the
destination host. In such a case Nova will just leave the migration
running, burning up host CPU cycles and trashing network bandwidth until
the end of the universe.
There are many features exposed by libvirt, that Nova could be using to
do a better job, but the question is obviously ...which features and how
should they be used. Fortunately Nova is not the first project to come
across this problem. The oVirt data center mgmt project has the exact
same problem. So rather than trying to invent some new logic for Nova,
we should, as an immediate bug fix task, just copy the oVirt logic from
VDSM
https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430
If we get this out to users and then get real world feedback on how it
operates, we will have an idea of how/where to focus future ongoing
efforts.
** Affects: nova
Importance: High
Assignee: Daniel Berrange (berrange)
Status: In Progress
** Changed in: nova
Importance: Undecided => High
** Changed in: nova
Assignee: (unassigned) => Daniel Berrange (berrange)
** Changed in: nova
Status: New => Confirmed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1429220
Title:
libvirt does ensure live migration will eventually complete (or abort)
Status in OpenStack Compute (Nova):
In Progress
Bug description:
Currently the libvirt driver's approach to live migration is bested
characterized as "launch & pray". It starts the live migration
operation and then just unconditionally waits for it to finish. It
never makes any attempt to tune its behaviour (for example changing
max downtime), nor does it look at the data transfer statistics to
check if it is making any progress, nor does it have any overall
timeout.
It is not uncommon for guests to have workloads that will preclude
live migration from completing. Basically they can be dirtying guest
RAM (or block devices) faster than the network is able to transfer it
to the destination host. In such a case Nova will just leave the
migration running, burning up host CPU cycles and trashing network
bandwidth until the end of the universe.
There are many features exposed by libvirt, that Nova could be using
to do a better job, but the question is obviously ...which features
and how should they be used. Fortunately Nova is not the first project
to come across this problem. The oVirt data center mgmt project has
the exact same problem. So rather than trying to invent some new logic
for Nova, we should, as an immediate bug fix task, just copy the oVirt
logic from VDSM
https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430
If we get this out to users and then get real world feedback on how it
operates, we will have an idea of how/where to focus future ongoing
efforts.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1429220/+subscriptions
Follow ups
References