yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #02129
[Bug 1097806] Re: VMs paused unbeknownst to nova compute are destroyed
** Changed in: nova/folsom
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1097806
Title:
VMs paused unbeknownst to nova compute are destroyed
Status in OpenStack Compute (Nova):
Fix Released
Status in OpenStack Compute (nova) folsom series:
Fix Released
Bug description:
Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's
workflow through a variety of means.
* By issuing virsh suspend
* By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
* By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
* By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.
Starting in Folsom (specifically
https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502
i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if
libvirt reports it as paused and this doesn't fit nova compute's
recorded state for the VM.
I surmise the original rationale is to destroy VMs that are paused by
IO errors or KVM emulation errors, which would also cause qemu to emit
STOP events.
The problem is that this will also destroy VMs that are paused through
a variety of valid reasons as outlined above.
The problem is exacerbated by a Libvirt bug
(https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the
state of a VM to paused even though the VM is running. The fix is
already committed upstream
(http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265)
and we are intending for it to make its way through backports into
distros.
Even with libvirt's bug fixed, there are still points in time at which
nova-compute will check a VMs state, find it paused for a valid
reason, and decide to erroneously destroy it.
The fix is to either remove this behavior, or to further query libvirt
for the paused reason, which will show conclusively whether the VM is
effectively crashed, or just paused.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1097806/+subscriptions