yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #27182
[Bug 1414065] [NEW] Nova can loose track of running VM if live migration raises an exception
Public bug reported:
There is a fairly serious bug in VM state handling during live
migration, with a result that if libvirt raises an error *after* the VM
has successfully live migrated to the target host, Nova can end up
thinking the VM is shutoff everywhere, despite it still being active.
The consequences of this are quite dire as the user can then manually
start the VM again and corrupt any data in shared volumes and the like.
The fun starts in the _live_migration method in
nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after*
the guest has completed migration.
At start of migration, we see an event received by Nova for the new QEMU
process starting on target host
2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Started"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 1 from
(pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
Upon migration completion we see CPUs start running on the target host
2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Resumed"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 1 from
(pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
And finally an event saying that the QEMU on the source host has stopped
2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state
after lifecycle event "Stopped"; current vm_state: active, current
task_state: migrating, current DB power_state: 1, VM power_state: 4 from
(pid=23081) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
It is the last event that causes the trouble. It causes Nova to mark the VM as shutoff at this point.
Normally the '_live_migrate' method would succeed and so Nova would then
immediately & explicitly mark the guest as running on the target host.
If an exception occurrs though, this explicit update of VM state doesn't
happen so Nova considers the guest shutoff, even though it is still
running :-(
The lifecycle events from libvirt have an associated "reason", so we could see that the shutoff event from libvirt corresponds to a migration being completed, and so not mark the VM as shutoff in Nova. We would also have to make sure the target host processes the 'resume' event upon migrate completion.
An safer approach though, might be to just mark the VM as in an ERROR
state if any exception occurs during migration.
** Affects: nova
Importance: High
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1414065
Title:
Nova can loose track of running VM if live migration raises an
exception
Status in OpenStack Compute (Nova):
New
Bug description:
There is a fairly serious bug in VM state handling during live
migration, with a result that if libvirt raises an error *after* the
VM has successfully live migrated to the target host, Nova can end up
thinking the VM is shutoff everywhere, despite it still being active.
The consequences of this are quite dire as the user can then manually
start the VM again and corrupt any data in shared volumes and the
like.
The fun starts in the _live_migration method in
nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after*
the guest has completed migration.
At start of migration, we see an event received by Nova for the new
QEMU process starting on target host
2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
state after lifecycle event "Started"; current vm_state: active,
current task_state: migrating, current DB power_state: 1, VM
power_state: 1 from (pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
Upon migration completion we see CPUs start running on the target host
2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
state after lifecycle event "Resumed"; current vm_state: active,
current task_state: migrating, current DB power_state: 1, VM
power_state: 1 from (pid=19494) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
And finally an event saying that the QEMU on the source host has
stopped
2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance:
12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
state after lifecycle event "Stopped"; current vm_state: active,
current task_state: migrating, current DB power_state: 1, VM
power_state: 4 from (pid=23081) handle_lifecycle_event
/home/berrange/src/cloud/nova/nova/compute/manager.py:1134
It is the last event that causes the trouble. It causes Nova to mark the VM as shutoff at this point.
Normally the '_live_migrate' method would succeed and so Nova would
then immediately & explicitly mark the guest as running on the target
host. If an exception occurrs though, this explicit update of VM
state doesn't happen so Nova considers the guest shutoff, even though
it is still running :-(
The lifecycle events from libvirt have an associated "reason", so we could see that the shutoff event from libvirt corresponds to a migration being completed, and so not mark the VM as shutoff in Nova. We would also have to make sure the target host processes the 'resume' event upon migrate completion.
An safer approach though, might be to just mark the VM as in an ERROR
state if any exception occurs during migration.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1414065/+subscriptions
Follow ups
References