← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1627134] Re: libvirt driver stuck deleting online snapshot

 

Reviewed:  https://review.openstack.org/378746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0f4bd241665c287e49f2d30ca79be96298217b7e
Submitter: Jenkins
Branch:    master

commit 0f4bd241665c287e49f2d30ca79be96298217b7e
Author: Matthew Booth <mbooth@xxxxxxxxxx>
Date:   Wed Sep 28 16:44:41 2016 +0100

    libvirt: Fix BlockDevice.wait_for_job when qemu reports no job
    
    We were misinterpreting the return value of blockJobInfo. Most
    immediately we were expecting it to return an integer, which has never
    been the case. blockJobInfo also raises an exception on error. Note
    that the implementation of abort_on_error has always expected an
    integer return value, and exceptions have never been handled, which
    means that abort_on_error has always been a no-op, and exceptions have
    never been swallowed. As this is also the most intuitive behaviour, we
    make it explicit by removing abort_on_error. Any exception raised by
    blockJobInfo will continue to propagate unhandled.
    
    We were obfuscating the return value indicating that the job did not
    exist, {}, by populating a BlockDeviceJobInfo with fake values. We
    de-obfuscate this by returning None instead, which is unambiguous.
    
    wait_for_job() was misnamed, as it does not wait. This is renamed to
    is_job_complete() to be less confusing. Note that the logic is
    reversed.
    
    After discussion with Eric Blake of the libvirt team (see review
    comments: https://review.openstack.org/#/c/375652/), we are now
    confident asserting that if no job exists then it has completed
    (although we are still not sure that it succeeded). Consequently we
    remove the wait_for_job_clean parameter, and always assume that no job
    means it has completed. Previously this was implicit because no job
    meant a defaulted BlockDeviceJobInfo.job value of 0.
    
    Co-authored-by: Sławek Kapłoński <slawek@xxxxxxxxxxxx>
    Closes-Bug: #1627134
    Change-Id: I2d0daa32b1d37fa60412ad7a374ee38cebdeb579


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1627134

Title:
  libvirt driver stuck deleting online snapshot

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Confirmed

Bug description:
  There is a problem in nova code in nova/virt/libvirt/driver.py:
              dev = guest.get_block_device(rebase_disk)
              if guest.is_active():
                  result = dev.rebase(rebase_base, relative=relative)
                  if result == 0:
                      LOG.debug('blockRebase started successfully',
                                instance=instance)

                  while dev.wait_for_job(abort_on_error=True):
                      LOG.debug('waiting for blockRebase job completion',
                                instance=instance)
                      time.sleep(0.5)

  It expects that libvirt block job stays for some period in 'cur == end' state, with end != 0 (wait_for_job logic). But in fact, at least for libvirt 1.3.3.2 and libvirt-python-1.2.17, we are not guaranteed to catch a job in such a state, before it dissapears and libvirt call returns empty result. Which is represented in get_job_info() by BlockDeviceJobInfo(job=0,bandwidth=0,cur=0,end=0).
  Such result doesn't match wait_for_job finish criteria (effective since I45ac06eae0b1949f746dae305469718649bfcf23 is merged).

  
  This bug started to occur in our third-party CI:
  http://openstack-3rd-party-storage-ci-logs.virtuozzo.com/28/314928/13/check/dsvm-tempest-kvm/5aae7aa

  n-cpu.log:
  2016-08-17 15:47:04.856 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] volume_snapshot_delete: delete_info: {u'type': u'qcow2', u'merge_target_file': None, u'file_to_merge': None, u'volume_id': u'3e64cef0-03e3-407e-b6c5-fac873a7c98a'} _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2054
  2016-08-17 15:47:04.864 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] found device at vda _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2098
  2016-08-17 15:47:04.864 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] disk: vda, base: None, bw: 0, relative: False _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2171
  2016-08-17 15:47:04.868 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] blockRebase started successfully _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2178
  2016-08-17 15:47:04.889 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
  2016-08-17 15:47:05.396 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
  2016-08-17 15:47:05.951 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
  2016-08-17 15:47:06.456 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
  2016-08-17 15:47:06.968 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182
  2016-08-17 15:47:07.594 42835 DEBUG nova.virt.libvirt.driver [req-81ae5279-0750-4745-839f-6d92f9ab3dc9 nova service] [instance: 018e566a-916b-4b76-9971-b4d4c12ea0b6] waiting for blockRebase job completion _volume_snapshot_delete /opt/stack/new/nova/nova/virt/libvirt/driver.py:2182

  BTW, I didn't found tests checking this in gate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1627134/+subscriptions


References