← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1421550] Re: Creating VM image fails under the race condition with detaching volume

 

Reviewed:  https://review.openstack.org/383859
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4aa39c44a4b08ee4e05548d5c258e795089b2bdd
Submitter: Jenkins
Branch:    master

commit 4aa39c44a4b08ee4e05548d5c258e795089b2bdd
Author: Matthew Booth <mbooth@xxxxxxxxxx>
Date:   Fri Oct 7 19:14:38 2016 +0100

    libvirt: Fix races with nfs volume mount/umount
    
    A single nfs export typically contains multiple volumes. We were
    handling this in the libvirt driver by:
    
    1. On mount, we 'ensure' the mount is available, so we don't fail if
       another instance already has it mounted.
    
    2. On umount, we trap and ignore 'device is busy' so we don't fail if
       another instance is already using it.
    
    Unfortunately, while this works for serial mounts and unmounts, there
    are multiple failure cases when volumes from the same export are
    mounted and unmounted simultaneously. It causes an error if an
    instance is stopped: as the qemu process is not actively using the
    mountpoint it will not prevent an unmount for another volume on the
    same mountpoint from succeeding. It will not be possible to restart
    the instance, because its mountpoint will not be mounted.
    
    To fix this, we create a singleton manager object, which tracks mounts
    and umount requests per export, and calls the real mount/umount only
    when required. It uses per-export locks to allow concurrency while
    avoiding races. Because we now expect to know the state of the host at
    all times, we no longer need to execute speculative mount/umount
    commands.
    
    As we track attachments (a mapping from volume to instance) rather
    than volumes, we also gracefully support multi-attach.
    
    This change implements this for nfs, but the solution is intended to
    be extended to all LibvirtBaseFileSystemVolumeDrivers.
    
    Closes-Bug: #1421550
    Change-Id: I3155984d76df06371a6c45f633aa448168a96d64


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1421550

Title:
  Creating VM image fails under the race condition with detaching volume

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Environment:
  nova 2014.2.1
  cinder 2014.2.1
  Ubuntu 14.04 LTS
  Cinder volumes whose backend is NFS are used.

  There are two 'ACTIVE' VM instances on the same compute node.
  Creating VM image(VM snapshot) fails under the race condition with detaching volume for the other VM instance.
  In creating VM image, starting VM instance fails(remains 'SHUTOFF' state) and the VM image is deleted.

  nova-compute's log is as follows:
  ---------------------------------------------------------------------------------------------------------------
  2015-01-26 10:28:47,000.744 11535 ERROR nova.virt.libvirt.driver [req-7ee6f579-63f5-4822-a2b3-10e53bb1dce0 None] Error launching a defined domain with XML: <domain type='kvm'>
  (snipped...)
  2015-01-26 10:28:47,000.767 11535 DEBUG nova.compute.manager [req-7ee6f579-63f5-4822-a2b3-10e53bb1dce0 None] [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0] Cleaning up image e8c3255c-b4e8-4324-addb-365c5d7b1868 decorated_function /usr/lib/python2.7/dist-packages/nova/compute/manager.py:373
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0] Traceback (most recent call last):
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 369, in decorated_function
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     *args, **kwargs)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3027, in snapshot_instance
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     task_states.IMAGE_SNAPSHOT)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3058, in _snapshot_instance
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     update_task_state)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1733, in snapshot
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     new_dom = self._create_domain(domain=virt_dom)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4338, in _create_domain
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     LOG.error(err)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     six.reraise(self.type_, self.value, self.tb)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4329, in _create_domain
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     domain.createWithFlags(launch_flags)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     result = proxy_call(self._autowrap, f, *args, **kwargs)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     rv = execute(f, *args, **kwargs)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     six.reraise(c, e, tb)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     rv = meth(*args, **kwargs)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in createWithFlags
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0] libvirtError: Failed to open file '/var/lib/nova/mnt/339b9d35866664794a8155657a049127/volume-27f59813-76f0-4b56-ba42-75922537c36c': No such file or directory
  2015-01-26 10:28:47,000.767 11535 TRACE nova.compute.manager [instance: d613cfc9-109a-4920-bbc1-41ce4146ace0] 
  ---------------------------------------------------------------------------------------------------------------

  In detaching volume, umounting NFS is performed without checking whether the other VM instance is being attached volumes or not.
  So if the other VM is stopped, umounting NFS succeeds.

  If there are no processes using the NFS directory, the NFS directory is umounted.
  nova/virt/libvirt/volumes.py(2014.2.1):
  ---------------------------------------------------------------------------------------------------------------
  class LibvirtNFSVolumeDriver(LibvirtBaseVolumeDriver):
  (snipped...)
      def disconnect_volume(self, connection_info, disk_dev):
          """Disconnect the volume."""

          export = connection_info['data']['export']
          mount_path = os.path.join(CONF.libvirt.nfs_mount_point_base,
                                    utils.get_hash_str(export))

          try:
              utils.execute('umount', mount_path, run_as_root=True)
          except processutils.ProcessExecutionError as exc:
              if ('device is busy' in exc.message or
                  'target is busy' in exc.message):
                  LOG.debug("The NFS share %s is still in use.", export)
              else:
                  LOG.exception(_LE("Couldn't unmount the NFS share %s"), export)
  ---------------------------------------------------------------------------------------------------------------
  * This code has been added in https://review.openstack.org/#/c/76558/.

  A VM instance is stopped once by creating VM image.
  And then detaching volume for the other VM instance on the same compute node is executed.
  If there are no VMs connecting cinder volumes, umounting NFS directory succeeds.
  After VM snapshot is completed, the VM instance is restarted.
  But the VM instance cannot access volumes because NFS directory has been umounted.
  So the error occurs and the VM instance cannot be restarted.

  And this issue also occurs under the race condition with starting a VM instance
  and detaching volumes for another VM instance('ACTIVE') on the same compute node.

  1. _connect_volume in starting a VM instance(mount NFS directory if not mounted.)
  2. _disconnect_volume in detaching volume(umount NFS directory if no processes use it.)
  3. The libvirt domain starts in starting a VM instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1421550/+subscriptions


References