yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #59654
[Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)
Reviewed: https://review.openstack.org/408668
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b78b1f8ce3aa407307a6adc5c60de1e960547897
Submitter: Jenkins
Branch: master
commit b78b1f8ce3aa407307a6adc5c60de1e960547897
Author: Sean Dague <sean@xxxxxxxxx>
Date: Thu Dec 8 10:09:06 2016 -0500
Bump prlimit cpu time for qemu from 2 to 8
We've got user reported bugs that when opperating with slow NFS
backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
guessed to be the issue at hand because if folks hot patch a qemu-img
that runs before the prlimitted one, the prlimitted one succeeds.
This increases the allowed cpu timeout, as well as tweaking the error
message so that we return something more prescriptive when the
qemu-img command fails with prlimit abort.
The original bug (#1449062) the main mitigation concern here was a
carefully crafted image that gets qemu-img to generate > 1G of json,
and hence could be a node attack vector. cpu_time was never mentioned,
and I think was added originally as a belt and suspenders addition. As
such, bumping it to 8 seconds shouldn't impact our protection in any
real way.
Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
Closes-Bug: #1646181
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181
Title:
NFS: Fail to boot VM out of large snapshots (30GB+)
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) newton series:
Confirmed
Bug description:
Description
===========
Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine.
Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka.
Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* I have OpenStack RDO MNewton (or Mitaka) installed and functional
* I boot a VM out of a QCOW2 image of about 1GB
* Then I loginto that VM and create a large file (33GB) to inflat the VM image
* then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage"
Alternatively to the steps above,
* I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage"
Then I do:
* then I try to boot a new VM out of that snapshot using the same network
* Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command
If I run the same command manually, it works:
/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
file format: raw
virtual size: 80G (85899345920 bytes)
disk size: 37G
Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node:
...
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command.
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
...
Expected result
===============
The VM should have been created/booted out of the larg snapshot image.
Actual result
=============
The command fails with exit code -9 when Noiva
Environment
===========
1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well
If this is from a distro please provide
$ [root@controller ~]# rpm -qa|grep nova
openstack-nova-console-14.0.0-1.el7.noarch
puppet-nova-9.4.0-1.el7.noarch
python-nova-14.0.0-1.el7.noarch
openstack-nova-novncproxy-14.0.0-1.el7.noarch
openstack-nova-conductor-14.0.0-1.el7.noarch
openstack-nova-api-14.0.0-1.el7.noarch
openstack-nova-common-14.0.0-1.el7.noarch
openstack-nova-scheduler-14.0.0-1.el7.noarch
openstack-nova-serialproxy-14.0.0-1.el7.noarch
python2-novaclient-6.0.0-1.el7.noarch
openstack-nova-cert-14.0.0-1.el7.noarch
2. Which hypervisor did you use?
KVM
details:
[root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
libvirt-gobject-0.1.9-1.el7.x86_64
libvirt-gconfig-0.1.9-1.el7.x86_64
libvirt-daemon-1.2.17-13.0.1.el7.x86_64
qemu-kvm-common-1.5.3-105.el7.x86_64
qemu-img-1.5.3-105.el7.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
libvirt-client-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64
qemu-kvm-1.5.3-105.el7.x86_64
libvirt-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-network-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.0.1.el7.x86_64
libvirt-glib-0.1.9-1.el7.x86_64
2. Which storage type did you use?
NFS
From a Sun ZFS or just from a server (both environment would have the same issue)
[root@compute4 nova]# mount | grep -i instance
172.31.254.254:/nova on /var/lib/nova/instances type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.31.0.114,local_lock=none,addr=172.31.254.254)
3. Which networking type did you use?
Neutron with OpenVSwitch
Logs & Configs
==============
From nova-compute.log:
2016-11-30 11:34:50.855 29776 INFO nova.compute.resource_tracker [req-dc5e7f75-5ae5-4929-a7b5-f4fdb95c171e - - - - -] Compute_service record updated for c00b05:c00b05
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Instance failed to spawn
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2078, in _build_resources
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] yield resources
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1920, in _build_and_run_instance
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] block_device_info=block_device_info)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2571, in spawn
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] admin_pass=admin_password)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2975, in _create_image
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] fallback_from_host)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3075, in _create_and_inject_local_root
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] instance, size, fallback_from_host)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6537, in _try_fetch_image_cache
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] size=size)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 218, in cache
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] *args, **kwargs)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 565, in create_image
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] self.verify_base_size(base, size)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 265, in verify_base_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] base_size = self.get_disk_size(base)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 277, in get_disk_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] return disk.get_disk_size(name)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 148, in get_disk_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] return images.qemu_img_info(path).virtual_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 67, in qemu_img_info
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] raise exception.InvalidDiskInfo(reason=msg)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
2016-11-30 11:35:11.005 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.016 29776 INFO nova.virt.libvirt.driver [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During wait destroy, instance disappeared.
2016-11-30 11:35:11.045 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deleting instance files /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del
2016-11-30 11:35:11.056 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deletion of /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del complete
2016-11-30 11:35:11.194 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.18 seconds to destroy the instance on the hypervisor.
2016-11-30 11:35:11.610 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.42 seconds to deallocate network for instance.
2016-11-30 11:35:11.610 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.789 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
Exit code: -9
Stdout: u''
Stderr: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1779, in _do_build_and_run_instance
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] filter_properties)
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1971, in _build_and_run_instance
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] reason=e.format_message())
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] BuildAbortException: Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
2016-11-30 11:35:11.881 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.09 seconds to deallocate network for instance.
2016-11-30 11:35:12.177 29776 INFO nova.compute.manager [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During sync_power_state the instance has a pending task (spawning). Skip.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1646181/+subscriptions
References