yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1642689] Re: ceph: volume detach fails with "libvirtError: operation failed: disk vdb not found"

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1642689@xxxxxxxxxxxxxxxxxx>
Date: Wed, 30 Nov 2016 04:41:33 -0000
Reply-to: Bug 1642689 <1642689@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.openstack.org/401375
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d079f37f7ccb8da9e74416cedce4f194b66ee1d0
Submitter: Jenkins
Branch:    master

commit d079f37f7ccb8da9e74416cedce4f194b66ee1d0
Author: int32bit <krystism@xxxxxxxxx>
Date:   Wed Nov 23 18:21:16 2016 +0800

    Fix wait for detach code to handle 'disk not found error'
    
    Currently, the code does an initial detach from the persistent and
    transient domains in one call, then follows up with a call of the
    retryable function, which first checks the domain xml before retrying
    the transient domain detach. If the transient domain detach (which is
    asynchronous) completes after we checked the domain xml, trying to
    detach it again will raise a LibvirtError exception. Then, our loop
    code in `_do_wait_and_retry_detach` doesn't catch it and will respond
    with error. We should be handling the `disk not found error` from
    libvirt and consider the job complete.
    
    Closes-Bug: #1642689
    
    Change-Id: I131aaf28d2f5d5d964d4045e3d7d62207079cfb0


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1642689

Title:
  ceph: volume detach fails with "libvirtError: operation failed: disk
  vdb not found"

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  In Progress

Bug description:
  Seeing this failure in this job:

  tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use[compute
  ,id-b467b54c-07a4-446d-a1cf-651dedcc3ff1]

  While detaching a volume on cleanup it times out waiting for the
  volume status to go from 'in-use' back to 'available':

  2016-11-17 11:58:59.110301 | Captured traceback:
  2016-11-17 11:58:59.110310 | ~~~~~~~~~~~~~~~~~~~
  2016-11-17 11:58:59.110322 |     Traceback (most recent call last):
  2016-11-17 11:58:59.110342 |       File "tempest/common/waiters.py", line 189, in wait_for_volume_status
  2016-11-17 11:58:59.110356 |         raise lib_exc.TimeoutException(message)
  2016-11-17 11:58:59.110373 |     tempest.lib.exceptions.TimeoutException: Request timed out
  2016-11-17 11:58:59.110405 |     Details: Volume db12eda4-4ce6-4f00-a4e0-9df115f230e5 failed to reach available status (current in-use) within the required time (196 s).

  The volume detach request is here:

  2016-11-17 11:58:59.031058 |     2016-11-17 11:38:55,018 8316 INFO     [tempest.lib.common.rest_client] Request (VolumesV1SnapshotTestJSON:_run_cleanups): 202 DELETE http://127.0.0.1:8774/v2.1/servers/584a65b5-07fa-4994-a2d5-1676d0e13a8c/os-volume_attachments/db12eda4-4ce6-4f00-a4e0-9df115f230e5 0.277s
  2016-11-17 11:58:59.031103 |     2016-11-17 11:38:55,018 8316 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
  2016-11-17 11:58:59.031113 |             Body: None
  2016-11-17 11:58:59.031235 |         Response - Headers: {'content-location': 'http://127.0.0.1:8774/v2.1/servers/584a65b5-07fa-4994-a2d5-1676d0e13a8c/os-volume_attachments/db12eda4-4ce6-4f00-a4e0-9df115f230e5', 'content-type': 'application/json', 'x-openstack-nova-api-version': '2.1', 'date': 'Thu, 17 Nov 2016 11:38:55 GMT', 'content-length': '0', 'status': '202', 'connection': 'close', 'x-compute-request-id': 'req-9f0541d3-6eec-4793-8852-7bd01708932e', 'openstack-api-version': 'compute 2.1', 'vary': 'X-OpenStack-Nova-API-Version'}
  2016-11-17 11:58:59.031248 |             Body: 

  Following the req-9f0541d3-6eec-4793-8852-7bd01708932e request ID to
  the compute logs we see this detach failure:

  http://logs.openstack.org/00/398800/1/gate/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/a387fb0/logs/screen-n-cpu.txt.gz?level=TRACE#_2016-11-17_11_39_00_649

  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [req-9f0541d3-6eec-4793-8852-7bd01708932e tempest-VolumesV1SnapshotTestJSON-1819335716 tempest-VolumesV1SnapshotTestJSON-1819335716] [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] Failed to detach volume db12eda4-4ce6-4f00-a4e0-9df115f230e5 from /dev/vdb
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] Traceback (most recent call last):
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/opt/stack/new/nova/nova/compute/manager.py", line 4757, in _driver_detach_volume
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     encryption=encryption)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1307, in detach_volume
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     wait_for_detach()
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 385, in func
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     return evt.wait()
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     return hubs.get_hub().switch()
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     return self.greenlet.switch()
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 136, in _run_loop
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     result = func(*self.args, **self.kw)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 356, in _func
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     result = f(*args, **kwargs)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 380, in _do_wait_and_retry_detach
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     self.detach_device(config, persistent=False, live=live)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 401, in detach_device
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     self._domain.detachDeviceFlags(device_xml, flags=flags)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     result = proxy_call(self._autowrap, f, *args, **kwargs)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     rv = execute(f, *args, **kwargs)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     six.reraise(c, e, tb)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     rv = meth(*args, **kwargs)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]   File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1190, in detachDeviceFlags
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]     if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] libvirtError: operation failed: disk vdb not found
  2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] 

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Failed%20to%20detach%20volume%5C%22%20AND%20message%3A%5C%22_do_wait_and_retry_detach%5C%22%20AND%20message%3A%5C%22libvirtError%3A%20operation%20failed%3A%20disk%5C%22%20AND%20message%3A%5C%22not%20found%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

  21 hits in the last 7 days, check and gate, all failures, 90% in ceph
  jobs.

  Looks like that wait for detach code should be handling the not found
  from libvirt and considering the job complete.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1642689/+subscriptions
References

[Bug 1642689] [NEW] ceph: volume detach fails with "libvirtError: operation failed: disk vdb not found"
From: Matt Riedemann, 2016-11-17