← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1660647] Re: _cleanup_failed_start aggressively removes local instance files when handling plug_vif failures

 

Reviewed:  https://review.openstack.org/427267
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=67aa277b4ef623c9877b97bfd7952f0bb1d80a81
Submitter: Jenkins
Branch:    master

commit 67aa277b4ef623c9877b97bfd7952f0bb1d80a81
Author: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date:   Tue Jan 31 15:26:38 2017 +0000

    libvirt: Limit destroying disks during cleanup to spawn
    
    Iab5afdf1b5b now ensures that cleanup is always called when VIF plugging
    errors are encountered by _create_domain_and_network. At present cleanup
    is always called with destroy_disks=True leading to any local instance
    files being removed from the host.
    
    _create_domain_and_network itself has various callers such as resume and
    hard_reboot that assume these files will persist any such failures. As a
    result the removal of these files will leave instances in an unbootable
    state.
    
    In order to correct this an additional destroy_disks_on_failures kwarg
    is now provided to _create_domain_and_network and passed down into
    cleanup. This kwarg defaults to False and is only enabled when
    _create_domain_and_network is used to spawn a new instance.
    
    Closes-bug: #1660647
    Change-Id: I38c969690fedb71c5b5ec4418c1b0dd53df733ec


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1660647

Title:
  _cleanup_failed_start aggressively removes local instance files when
  handling plug_vif failures

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1 [1] ensured that _cleanup_failed_start is always called if we encounter VIF plugging failures in _create_domain_and_network. However this currently leads to any local instance files being removed as cleanup is called with destroy_disks=True.

  As such any failures when resuming or restarting an instance will lead
  to these files being removed and the instance left in an unbootable
  state. IMHO these files should only be removed when cleaning up after
  errors hit while initially spawning an instance.

  Steps to reproduce
  ==================
  - Boot an instance using local disks
  - Stop the instance
  - Start the instance, causing a timeout or other failure during plug_vifs
  - Attempt to start the instance again

  Expected result
  ===============
  The local instance files are left on the host if instances are rebooting or resuming.

  Actual result
  =============
  The local instance files are removed from the host if _cleanup_failed_start is called.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
     list for all releases: http://docs.openstack.org/releases/

     $ pwd
     /opt/stack/nova
     $ git rev-parse HEAD
     42222969a21ee28ef4a68bd5ab1ec8a12c4ad126

  
  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt + KVM

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     N/A

  Logs & Configs
  ==============

  $ nova boot --image cirros-0.3.4-x86_64-uec --flavor 1 test-boot
  [..]
  $ nova stop test-boot
  $ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
  total 18596
  -rw-rw-r--. 1 root  root        16699 Jan 31 09:30 console.log
  -rw-r--r--. 1 root  root     10289152 Jan 31 09:30 disk
  -rw-r--r--. 1 stack libvirtd      257 Jan 31 09:29 disk.info
  -rw-rw-r--. 1 qemu  qemu      4979632 Jan 31 09:29 kernel
  -rw-rw-r--. 1 qemu  qemu      3740163 Jan 31 09:29 ramdisk

  I used the following change to artificially recreate an issue plugging
  the VIFs :

  $ git diff
  diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
  index 33e3157..248e960 100644
  --- a/nova/virt/libvirt/driver.py
  +++ b/nova/virt/libvirt/driver.py
  @@ -5015,6 +5015,7 @@ class LibvirtDriver(driver.ComputeDriver):
           pause = bool(events)
           guest = None
           try:
  +            raise exception.VirtualInterfaceCreateException()
               with self.virtapi.wait_for_instance_event(
                       instance, events, deadline=timeout,
                       error_callback=self._neutron_failed_callback):

  $ nova start test-boot
  Request to start server test-boot has been accepted.
  $ nova list
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  | ID                                   | Name      | Status  | Task State | Power State | Networks                       |
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  | be6cb386-e005-4fb2-8332-7e0c375ee452 | test-boot | SHUTOFF | -          | Shutdown    | public=172.24.4.8, 2001:db8::9 |
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  $ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
  ls: cannot access '../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/': No such file or directory

  Future attempts to start the instance will fail as a result :

  $ nova start test-boot
  Request to start server test-boot has been accepted.
  $ vi ../logs/n-cpu.log
  [..]
  5353 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server Traceback (most recent call last):
  5354 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
  5355 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  5356 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
  5357 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  5358 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
  5359 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  5360 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 75, in wrapped
  5361 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     function_name, call_dict, binary)
  5362 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5363 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5364 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5365 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5366 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 66, in wrapped
  5367 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  5368 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 188, in decorated_function
  5369 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     LOG.warning(msg, e, instance=instance)
  5370 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5371 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5372 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5373 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5374 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 157, in decorated_function
  5375 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5376 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 685, in decorated_function
  5377 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5378 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 216, in decorated_function
  5379 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
  5380 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5381 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5382 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5383 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5384 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 204, in decorated_function
  5385 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5386 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2524, in start_instance
  5387 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._power_on(context, instance)
  5388 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2494, in _power_on
  5389 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
  5390 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2494, in power_on
  5391 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._hard_reboot(context, instance, network_info, block_device_info)
  5392 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2373, in _hard_reboot
  5393 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
  5394 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 6931, in _get_instance_disk_info
  5395 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     dk_size = int(os.path.getsize(path))
  5396 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib64/python2.7/genericpath.py", line 57, in getsize
  5397 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return os.stat(filename).st_size
  5398 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server OSError: [Errno 2] No such file or directory: '/opt/stack/data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/disk'
  5399 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1660647/+subscriptions


References