yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1662867] Re: update_available_resource_for_node racing instance deletion

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Wed, 14 Mar 2018 16:49:05 -0000
Reply-to: Bug 1662867 <1662867@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
We still see this in CI runs periodically:

http://logs.openstack.org/70/552170/2/check/legacy-tempest-dsvm-
multinode-live-
migration/474e6b5/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_13_22_24_03_425580

We can see that in this case, the instance in question has it's files
deleted right before the libvirt driver, via the
update_available_resource periodic task, gets to processing that
instance:

Mar 13 22:24:03.367217 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deleting instance files /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del
Mar 13 22:24:03.367725 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deletion of /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del complete
Mar 13 22:24:03.425580 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] Error updating resources for node ubuntu-xenial-rax-iad-0002937850.: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.425969 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.426340 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Exit code: 1
Mar 13 22:24:03.426705 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stdout: u''
Mar 13 22:24:03.427044 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.427380 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Traceback (most recent call last):
Mar 13 22:24:03.427724 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/compute/manager.py", line 7320, in update_available_resource_for_node
Mar 13 22:24:03.428084 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
Mar 13 22:24:03.428410 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 664, in update_available_resource
Mar 13 22:24:03.428653 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
Mar 13 22:24:03.428898 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6432, in get_available_resource
Mar 13 22:24:03.429219 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
Mar 13 22:24:03.429576 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7977, in _get_disk_over_committed_size_total
Mar 13 22:24:03.429850 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     config, block_device_info)
Mar 13 22:24:03.430176 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7896, in _get_instance_disk_info_from_config
Mar 13 22:24:03.430498 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     virt_size = disk_api.get_disk_size(path)
Mar 13 22:24:03.430768 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/disk/api.py", line 140, in get_disk_size
Mar 13 22:24:03.431081 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     return images.qemu_img_info(path).virtual_size
Mar 13 22:24:03.431316 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/images.py", line 82, in qemu_img_info
Mar 13 22:24:03.431546 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     raise exception.InvalidDiskInfo(reason=msg)
Mar 13 22:24:03.431785 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.432017 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.432266 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Exit code: 1
Mar 13 22:24:03.432498 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Stdout: u''
Mar 13 22:24:03.432742 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.432981 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager 

We should likely just catch the ProcessExecutionError, make a best
effort attempt at checking the return code (1) and error message for "No
such file or directory" and if that's the case, skip that instance and
don't fail the entire process of building the current set of resources
for the host.

I thought about adding the deleted=False filter to this query:

https://github.com/openstack/nova/blob/96fdce7cab7c50736fb96bcaa622dac825031a2f/nova/virt/libvirt/driver.py#L7932

But at this point, libvirt has already told us the domain still exists
on the host so filtering by deleted=False doesn't really help here since
the instance isn't deleted yet, it's still on the host according to the
hypervisor, but it's in the process of being deleted and the files are
already cleaned up by the time we hit the error.

** Changed in: nova
       Status: Expired => Triaged

** Changed in: nova
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1662867

Title:
  update_available_resource_for_node racing instance deletion

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Description
  ===========
  The following trace was seen multiple times during a CI run for https://review.openstack.org/#/c/383859/ :

  http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?level=ERROR#_2017-02-07_19_10_25_548
  http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?level=ERROR#_2017-02-07_19_15_26_004

  In the first example a request to terminate the instance 60b7cb32
  appears to race an existing run of the
  update_available_resource_for_node periodic task :

  req-fa96477b-34d2-4ab6-83bf-24c269ed7c28

  http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-
  devstack-plugin-nfs-
  nv/a4c1057/logs/screen-n-cpu.txt.gz?#_2017-02-07_19_10_25_478

  req-dc60ed89-d3da-45f6-b98c-8f57c767d751

  http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-
  devstack-plugin-nfs-
  nv/a4c1057/logs/screen-n-cpu.txt.gz?#_2017-02-07_19_10_25_548

  Steps to reproduce
  ==================
  Delete an instance while update_available_resource_for_node is running

  Expected result
  ===============
  Either swallow the exception and move on or lock instances in such a way that they can't be removed while this periodic task is running.

  Actual result
  =============
  update_available_resource_for_node fails and stops.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     https://review.openstack.org/#/c/383859/ - but it should reproduce
  against master.

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     n/a

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     n/a

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1662867/+subscriptions
References

[Bug 1662867] [NEW] update_available_resource_for_node racing instance deletion
From: Lee Yarwood, 2017-02-08