← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1414065] Re: Nova can lose track of running VM if live migration raises an exception

 

** Changed in: nova/juno
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1414065

Title:
  Nova can lose track of running VM if live migration raises an
  exception

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) juno series:
  Fix Released

Bug description:
  There is a fairly serious bug in VM state handling during live
  migration, with a result that if libvirt raises an error *after* the
  VM has successfully live migrated to the target host, Nova can end up
  thinking the VM is shutoff everywhere, despite it still being active.
  The consequences of this are quite dire as the user can then manually
  start the VM again and corrupt any data in shared volumes and the
  like.

  The fun starts in the _live_migration method in
  nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after*
  the guest has completed migration.

  At start of migration, we see an event received by Nova for the new
  QEMU process starting on target host

  2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Started"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 1 from (pid=19494) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  
  Upon migration completion we see CPUs start running on the target host

  2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Resumed"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 1 from (pid=19494) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  And finally an event saying that the QEMU on the source host has
  stopped

  2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance:
  12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power
  state after lifecycle event "Stopped"; current vm_state: active,
  current task_state: migrating, current DB power_state: 1, VM
  power_state: 4 from (pid=23081) handle_lifecycle_event
  /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

  
  It is the last event that causes the trouble.  It causes Nova to mark the VM as shutoff at this point.

  Normally the '_live_migrate' method would succeed and so Nova would
  then immediately & explicitly mark the guest as running on the target
  host.   If an exception occurrs though, this explicit update of VM
  state doesn't happen so Nova considers the guest shutoff, even though
  it is still running :-(

  
  The lifecycle events from libvirt have an associated "reason", so we could see that the shutoff event from libvirt corresponds to a migration being completed, and so not mark the VM as shutoff in Nova.  We would also have to make sure the target host processes the 'resume' event upon migrate completion.

  An safer approach though, might be to just mark the VM as in an ERROR
  state if any exception occurs during migration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1414065/+subscriptions


References