← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1131284] Re: Folsom erroneously destroys paused VMs

 

** Changed in: nova (Ubuntu)
       Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1131284

Title:
  Folsom erroneously destroys paused VMs

Status in OpenStack Compute (Nova):
  Invalid
Status in “nova” package in Ubuntu:
  Fix Released

Bug description:
  Requesting to add upstream stable commit:
  https://github.com/openstack/nova/commit/7ace55fcf9e1b7fea074f6c0331b6feafbbc4178

  reviewed here:
  https://review.openstack.org/#/c/20337/

  and which addresses this upstream bug:
  https://bugs.launchpad.net/nova/+bug/1097806

  (updated description of bug follows)

  Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's
  workflow through a variety of means.

  * By issuing virsh suspend
  * By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
  * By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
  * By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.

  Starting in Folsom (specifically
  https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502
  i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if
  libvirt reports it as paused and this doesn't fit nova compute's
  recorded state for the VM.

  While the original rationale is to destroy VMs that are paused by IO
  errors or KVM emulation errors, which would also cause qemu to emit
  STOP events.

  The problem is that this will also destroy VMs that are paused through
  a variety of valid reasons as outlined above.

  The problem is exacerbated by a Libvirt bug
  (https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the
  state of a VM to paused even though the VM is running. The fix is
  already committed upstream
  (http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265),
  as well as being integrated into Raring and triaged for backport into
  Precise: https://bugs.launchpad.net/bugs/1097824.

  Even with libvirt's bug fixed, there are still points in time at which
  nova-compute will check a VMs state, find it paused for a valid
  reason, and decide to erroneously destroy it.

  The fix is to either remove this behavior, or to further query libvirt
  for the paused reason, which will show conclusively whether the VM is
  effectively crashed, or just paused.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1131284/+subscriptions