← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)

 

Reviewed:  https://review.openstack.org/408668
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b78b1f8ce3aa407307a6adc5c60de1e960547897
Submitter: Jenkins
Branch:    master

commit b78b1f8ce3aa407307a6adc5c60de1e960547897
Author: Sean Dague <sean@xxxxxxxxx>
Date:   Thu Dec 8 10:09:06 2016 -0500

    Bump prlimit cpu time for qemu from 2 to 8
    
    We've got user reported bugs that when opperating with slow NFS
    backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
    guessed to be the issue at hand because if folks hot patch a qemu-img
    that runs before the prlimitted one, the prlimitted one succeeds.
    
    This increases the allowed cpu timeout, as well as tweaking the error
    message so that we return something more prescriptive when the
    qemu-img command fails with prlimit abort.
    
    The original bug (#1449062) the main mitigation concern here was a
    carefully crafted image that gets qemu-img to generate > 1G of json,
    and hence could be a node attack vector. cpu_time was never mentioned,
    and I think was added originally as a belt and suspenders addition. As
    such, bumping it to 8 seconds shouldn't impact our protection in any
    real way.
    
    Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
    Closes-Bug: #1646181


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181

Title:
  NFS: Fail to boot VM out of large snapshots (30GB+)

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Confirmed

Bug description:
  Description
  ===========
  Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine.
  Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka.

  Steps to reproduce
  ==================
  A chronological list of steps which will bring off the
  issue you noticed:
  * I have OpenStack RDO MNewton (or Mitaka) installed and functional
  * I boot a VM out of a QCOW2 image of about 1GB
  * Then I loginto that VM and create a large file (33GB) to inflat the VM image
  * then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage"

  Alternatively to the steps above,
  * I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage"

  Then I do:
  * then I try to boot a new VM out of that snapshot using the same network
  * Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command

  If I run the same command manually, it works:
  /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  file format: raw
  virtual size: 80G (85899345920 bytes)
  disk size: 37G

  Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node:
  ...
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command.
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
  ...

  
  Expected result
  ===============
  The VM should have been created/booted out of the larg snapshot image.

  Actual result
  =============
  The command fails with exit code -9 when Noiva

  Environment
  ===========
  1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well

     If this is from a distro please provide
         $ [root@controller ~]# rpm -qa|grep nova
  openstack-nova-console-14.0.0-1.el7.noarch
  puppet-nova-9.4.0-1.el7.noarch
  python-nova-14.0.0-1.el7.noarch
  openstack-nova-novncproxy-14.0.0-1.el7.noarch
  openstack-nova-conductor-14.0.0-1.el7.noarch
  openstack-nova-api-14.0.0-1.el7.noarch
  openstack-nova-common-14.0.0-1.el7.noarch
  openstack-nova-scheduler-14.0.0-1.el7.noarch
  openstack-nova-serialproxy-14.0.0-1.el7.noarch
  python2-novaclient-6.0.0-1.el7.noarch
  openstack-nova-cert-14.0.0-1.el7.noarch

  
  2. Which hypervisor did you use?
     KVM
     
     details:
     [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
  libvirt-gobject-0.1.9-1.el7.x86_64
  libvirt-gconfig-0.1.9-1.el7.x86_64
  libvirt-daemon-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-common-1.5.3-105.el7.x86_64
  qemu-img-1.5.3-105.el7.x86_64
  ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
  libvirt-client-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64
  libvirt-python-1.2.17-2.el7.x86_64
  libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-1.5.3-105.el7.x86_64
  libvirt-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-network-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-nwfilter-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-qemu-1.2.17-13.0.1.el7.x86_64
  libvirt-glib-0.1.9-1.el7.x86_64

  
  2. Which storage type did you use?
     NFS
     From a Sun ZFS or just from a server (both environment would have the same issue)
     [root@compute4 nova]# mount | grep -i instance
  172.31.254.254:/nova on /var/lib/nova/instances type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.31.0.114,local_lock=none,addr=172.31.254.254)


  3. Which networking type did you use?
     Neutron with OpenVSwitch

  Logs & Configs
  ==============

  From nova-compute.log:

  2016-11-30 11:34:50.855 29776 INFO nova.compute.resource_tracker [req-dc5e7f75-5ae5-4929-a7b5-f4fdb95c171e - - - - -] Compute_service record updated for c00b05:c00b05
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Instance failed to spawn
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2078, in _build_resources
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     yield resources
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1920, in _build_and_run_instance
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     block_device_info=block_device_info)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2571, in spawn
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     admin_pass=admin_password)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2975, in _create_image
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     fallback_from_host)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3075, in _create_and_inject_local_root
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     instance, size, fallback_from_host)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6537, in _try_fetch_image_cache
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     size=size)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 218, in cache
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     *args, **kwargs)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 565, in create_image
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     self.verify_base_size(base, size)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 265, in verify_base_size
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     base_size = self.get_disk_size(base)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 277, in get_disk_size
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     return disk.get_disk_size(name)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 148, in get_disk_size
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     return images.qemu_img_info(path).virtual_size
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 67, in qemu_img_info
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     raise exception.InvalidDiskInfo(reason=msg)
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
  2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
  2016-11-30 11:35:11.005 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
  2016-11-30 11:35:11.016 29776 INFO nova.virt.libvirt.driver [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During wait destroy, instance disappeared.
  2016-11-30 11:35:11.045 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deleting instance files /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del
  2016-11-30 11:35:11.056 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deletion of /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del complete
  2016-11-30 11:35:11.194 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.18 seconds to destroy the instance on the hypervisor.
  2016-11-30 11:35:11.610 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.42 seconds to deallocate network for instance.
  2016-11-30 11:35:11.610 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
  2016-11-30 11:35:11.789 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
  Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
  Exit code: -9
  Stdout: u''
  Stderr: u''
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1779, in _do_build_and_run_instance
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     filter_properties)
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1971, in _build_and_run_instance
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]     reason=e.format_message())
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] BuildAbortException: Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
  2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
  2016-11-30 11:35:11.881 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.09 seconds to deallocate network for instance.
  2016-11-30 11:35:12.177 29776 INFO nova.compute.manager [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During sync_power_state the instance has a pending task (spawning). Skip.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1646181/+subscriptions


References