yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #57421
[Bug 1627134] Re: libvirt driver stuck deleting online snapshot
Reviewed: https://review.openstack.org/378746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0f4bd241665c287e49f2d30ca79be96298217b7e
Submitter: Jenkins
Branch: master
commit 0f4bd241665c287e49f2d30ca79be96298217b7e
Author: Matthew Booth <mbooth@xxxxxxxxxx>
Date: Wed Sep 28 16:44:41 2016 +0100
libvirt: Fix BlockDevice.wait_for_job when qemu reports no job
We were misinterpreting the return value of blockJobInfo. Most
immediately we were expecting it to return an integer, which has never
been the case. blockJobInfo also raises an exception on error. Note
that the implementation of abort_on_error has always expected an
integer return value, and exceptions have never been handled, which
means that abort_on_error has always been a no-op, and exceptions have
never been swallowed. As this is also the most intuitive behaviour, we
make it explicit by removing abort_on_error. Any exception raised by
blockJobInfo will continue to propagate unhandled.
We were obfuscating the return value indicating that the job did not
exist, {}, by populating a BlockDeviceJobInfo with fake values. We
de-obfuscate this by returning None instead, which is unambiguous.
wait_for_job() was misnamed, as it does not wait. This is renamed to
is_job_complete() to be less confusing. Note that the logic is
reversed.
After discussion with Eric Blake of the libvirt team (see review
comments: https://review.openstack.org/#/c/375652/), we are now
confident asserting that if no job exists then it has completed
(although we are still not sure that it succeeded). Consequently we
remove the wait_for_job_clean parameter, and always assume that no job
means it has completed. Previously this was implicit because no job
meant a defaulted BlockDeviceJobInfo.job value of 0.
Co-authored-by: Sławek Kapłoński <slawek@xxxxxxxxxxxx>
Closes-Bug: #1627134
Change-Id: I2d0daa32b1d37fa60412ad7a374ee38cebdeb579
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1627134
Title:
libvirt driver stuck deleting online snapshot
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) newton series:
Confirmed
Bug description:
There is a problem in nova code in nova/virt/libvirt/driver.py:
dev = guest.get_block_device(rebase_disk)
if guest.is_active():
result = dev.rebase(rebase_base, relative=relative)
if result == 0:
LOG.debug('blockRebase started successfully',
instance=instance)
while dev.wait_for_job(abort_on_error=True):
LOG.debug('waiting for blockRebase job completion',
instance=instance)
time.sleep(0.5)
It expects that libvirt block job stays for some period in 'cur == end' state, with end != 0 (wait_for_job logic). But in fact, at least for libvirt 1.3.3.2 and libvirt-python-1.2.17, we are not guaranteed to catch a job in such a state, before it dissapears and libvirt call returns empty result. Which is represented in get_job_info() by BlockDeviceJobInfo(job=0,bandwidth=0,cur=0,end=0).
Such result doesn't match wait_for_job finish criteria (effective since I45ac06eae0b1949f746dae305469718649bfcf23 is merged).
This bug started to occur in our third-party CI:
http://openstack-3rd-party-storage-ci-logs.virtuozzo.com/28/314928/13/check/dsvm-tempest-kvm/5aae7aa
n-cpu.log:
2016-08-17 15:47:04.856 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] volume_snapshot_delete: delete_info: {u'type': u'qcow2', u'merge_target_file': None, u'file_to_merge': None, u'volume_id': u'3e64cef0-03e3-407e-b6c5-fac873a7c98a'} _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2054
2016-08-17 15:47:04.864 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] found device at vda _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2098
2016-08-17 15:47:04.864 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] disk: vda, base: None, bw: 0, relative: False _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2171
2016-08-17 15:47:04.868 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] blockRebase started successfully _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2178
2016-08-17 15:47:04.889 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
2016-08-17 15:47:05.396 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
2016-08-17 15:47:05.951 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
2016-08-17 15:47:06.456 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
2016-08-17 15:47:06.968 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
2016-08-17 15:47:07.594 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
BTW, I didn't found tests checking this in gate.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1627134/+subscriptions
References