← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1097806] Re: VMs paused unbeknownst to nova compute are destroyed

 

** Changed in: nova/folsom
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1097806

Title:
  VMs paused unbeknownst to nova compute are destroyed

Status in OpenStack Compute (Nova):
  Fix Released
Status in OpenStack Compute (nova) folsom series:
  Fix Released

Bug description:
  Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's
  workflow through a variety of means.

  * By issuing virsh suspend
  * By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
  * By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
  * By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.

  Starting in Folsom (specifically
  https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502
  i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if
  libvirt reports it as paused and this doesn't fit nova compute's
  recorded state for the VM.

  I surmise the original rationale is to destroy VMs that are paused by
  IO errors or KVM emulation errors, which would also cause qemu to emit
  STOP events.

  The problem is that this will also destroy VMs that are paused through
  a variety of valid reasons as outlined above.

  The problem is exacerbated by a Libvirt bug
  (https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the
  state of a VM to paused even though the VM is running. The fix is
  already committed upstream
  (http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265)
  and we are intending for it to make its way through backports into
  distros.

  Even with libvirt's bug fixed, there are still points in time at which
  nova-compute will check a VMs state, find it paused for a valid
  reason, and decide to erroneously destroy it.

  The fix is to either remove this behavior, or to further query libvirt
  for the paused reason, which will show conclusively whether the VM is
  effectively crashed, or just paused.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1097806/+subscriptions