yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1774249] Re: update_available_resource will raise DiskNotFound after resize but before confirm

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1774249@xxxxxxxxxxxxxxxxxx>
Date: Tue, 21 May 2019 10:56:19 -0000
Reply-to: Bug 1774249 <1774249@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.opendev.org/571410
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=966192704c20d1b4e9faf384c8dafac8ea6e06ea
Submitter: Zuul
Branch:    master

commit 966192704c20d1b4e9faf384c8dafac8ea6e06ea
Author: jichenjc <jichenjc@xxxxxxxxxx>
Date:   Mon May 21 02:03:51 2018 +0800

    libvirt: Do not reraise DiskNotFound exceptions during resize
    
    When an instance has VERIFY_RESIZE status, the instance disk on the
    source compute host has moved to <instance_path>/<instance_uuid>_resize
    folder, which leads to disk not found errors if the update available
    resource periodic task on the source compute runs before resize is
    actually confirmed.
    
    Icec2769bf42455853cbe686fb30fda73df791b25 almost fixed this issue but it
    will only set reraise to False when task_state is not None, that isn't
    the case when an instance is resized but resize is not yet confirmed.
    This patch adds a condition based on vm_state to ensure we don't
    reraise DiskNotFound exceptions while resize is not confirmed.
    
    Closes-Bug: 1774249
    Co-Authored-By: Vladyslav Drok <vdrok@xxxxxxxxxxxx>
    Change-Id: Id687e11e235fd6c2f99bb647184310dfdce9a08d


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1774249

Title:
  update_available_resource will raise DiskNotFound after resize but
  before confirm

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ocata series:
  Triaged
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged

Bug description:
  Original reported in RH Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1584315

  Tested on OSP12 (Pike), but appears to be still present on master.
  Should only occur if nova compute is configured to use local file
  instance storage.

  Create instance A on compute X

  Resize instance A to compute Y
    Domain is powered off
    /var/lib/nova/instances/<uuid A> renamed to <uuid A>_resize on X
    Domain is *not* undefined

  On compute X:
    update_available_resource runs as a periodic task
    First action is to update self
    rt calls driver.get_available_resource()
    ...calls _get_disk_over_committed_size_total
    ...iterates over all defined domains, including the ones whose disks we renamed
    ...fails because a referenced disk no longer exists

  Results in errors in nova-compute.log:

      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager [req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources for node compute-0.localdomain.: DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most recent call last):
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in update_available_resource_for_node
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, in update_available_resource
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in get_available_resource
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in _get_disk_over_committed_size_total
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     config, block_device_info)
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in _get_instance_disk_info_from_config
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     dk_size = disk_api.get_allocated_disk_size(path)
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in get_allocated_disk_size
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     return images.qemu_img_info(path).disk_size
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
      2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk

  And resource tracker is no longer updated. We can find lots of these
  in the gate.

  Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly
  mitigates this, but doesn't because task_state is not set while the
  instance is awaiting confirm.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1774249/+subscriptions
References

[Bug 1774249] [NEW] update_available_resource will raise DiskNotFound after resize but before confirm
From: Matthew Booth, 2018-05-30