yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #01324
[Bug 1131284] Re: Folsom erroneously destroys paused VMs
** Changed in: nova (Ubuntu)
Status: Confirmed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1131284
Title:
Folsom erroneously destroys paused VMs
Status in OpenStack Compute (Nova):
Invalid
Status in “nova” package in Ubuntu:
Fix Released
Bug description:
Requesting to add upstream stable commit:
https://github.com/openstack/nova/commit/7ace55fcf9e1b7fea074f6c0331b6feafbbc4178
reviewed here:
https://review.openstack.org/#/c/20337/
and which addresses this upstream bug:
https://bugs.launchpad.net/nova/+bug/1097806
(updated description of bug follows)
Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's
workflow through a variety of means.
* By issuing virsh suspend
* By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
* By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
* By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.
Starting in Folsom (specifically
https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502
i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if
libvirt reports it as paused and this doesn't fit nova compute's
recorded state for the VM.
While the original rationale is to destroy VMs that are paused by IO
errors or KVM emulation errors, which would also cause qemu to emit
STOP events.
The problem is that this will also destroy VMs that are paused through
a variety of valid reasons as outlined above.
The problem is exacerbated by a Libvirt bug
(https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the
state of a VM to paused even though the VM is running. The fix is
already committed upstream
(http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265),
as well as being integrated into Raring and triaged for backport into
Precise: https://bugs.launchpad.net/bugs/1097824.
Even with libvirt's bug fixed, there are still points in time at which
nova-compute will check a VMs state, find it paused for a valid
reason, and decide to erroneously destroy it.
The fix is to either remove this behavior, or to further query libvirt
for the paused reason, which will show conclusively whether the VM is
effectively crashed, or just paused.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1131284/+subscriptions