← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1824974] [NEW] Nova Compute Manager (Resource update) fails if a disk is missing

 

Public bug reported:

===Description===

We recently ran into an issue with the periodic resource update on a kvm
hypervisor.

if for some reason a disk is missing or unreadable then the periodic
resource updater will fail with

2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager [req-a9ee69b3-6e90-4137-8a6a-c4a4a4426b5a 05fafdddb8f7495fbe162e748fe3f63f 13a3d2b57166496e86e9d25ec8967869 - - -] Error updating resources for node ock00043i2.frn00006.cni.ukcloud.com. 
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager Traceback (most recent call last):
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6460, in update_available_resource_for_node
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     rt.update_available_resource(context)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 511, in update_available_resource
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     resources = self.driver.get_available_resource(self.nodename)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5599, in get_available_resource
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7192, in _get_disk_over_committed_size_total
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     block_device_info=block_device_info)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7101, in _get_instance_disk_info
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     dk_size = disk_api.get_allocated_disk_size(path)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 158, in get_allocated_disk_size
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     return images.qemu_img_info(path).disk_size
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/7af164a3-2eb9-4cb3-9a81-3fd57bf087e9/disk
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager

This is of course expected, in that the disk is missing, but the issue
arises by the fact that the nova.compute.manager periodic resource
update is unable to update ANY resource information for this hypervisor,
which leads to stale stats.

It would be better if this was logged, but the other stats CPU/Memory
were able to be updated.

Steps to reproduce 
===================

1. Boot an instance on a hypervisor
2. shut the instance down
3. rename the folder the disk is located in
4. watch the logs for the above error
5. Build additional instances on hypervisor
6. Look and see if stats are updated.

Expected results
=================
CPU and memory stats should still be updated, with maybe disk stats being not updated, or marked as stale?

Actual Results
=================
No stats are updated for the hypervisor.

Environment
===================
Newton OpenStack - Although looking at the latest code, i think this is still an issue in the latest release.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1824974

Title:
  Nova Compute Manager (Resource update) fails if a disk is missing

Status in OpenStack Compute (nova):
  New

Bug description:
  ===Description===

  We recently ran into an issue with the periodic resource update on a
  kvm hypervisor.

  if for some reason a disk is missing or unreadable then the periodic
  resource updater will fail with

  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager [req-a9ee69b3-6e90-4137-8a6a-c4a4a4426b5a 05fafdddb8f7495fbe162e748fe3f63f 13a3d2b57166496e86e9d25ec8967869 - - -] Error updating resources for node ock00043i2.frn00006.cni.ukcloud.com. 
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager Traceback (most recent call last):
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6460, in update_available_resource_for_node
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     rt.update_available_resource(context)
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 511, in update_available_resource
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     resources = self.driver.get_available_resource(self.nodename)
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5599, in get_available_resource
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7192, in _get_disk_over_committed_size_total
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     block_device_info=block_device_info)
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7101, in _get_instance_disk_info
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     dk_size = disk_api.get_allocated_disk_size(path)
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 158, in get_allocated_disk_size
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     return images.qemu_img_info(path).disk_size
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager     raise exception.DiskNotFound(location=path)
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/7af164a3-2eb9-4cb3-9a81-3fd57bf087e9/disk
  2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager

  This is of course expected, in that the disk is missing, but the issue
  arises by the fact that the nova.compute.manager periodic resource
  update is unable to update ANY resource information for this
  hypervisor, which leads to stale stats.

  It would be better if this was logged, but the other stats CPU/Memory
  were able to be updated.

  Steps to reproduce 
  ===================

  1. Boot an instance on a hypervisor
  2. shut the instance down
  3. rename the folder the disk is located in
  4. watch the logs for the above error
  5. Build additional instances on hypervisor
  6. Look and see if stats are updated.

  Expected results
  =================
  CPU and memory stats should still be updated, with maybe disk stats being not updated, or marked as stale?

  Actual Results
  =================
  No stats are updated for the hypervisor.

  Environment
  ===================
  Newton OpenStack - Although looking at the latest code, i think this is still an issue in the latest release.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1824974/+subscriptions