← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1660647] [NEW] _cleanup_failed_start aggressively removes local instance files when handling plug_vif failures

 

Public bug reported:

Description
===========
Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1 [1] ensured that _cleanup_failed_start is always called if we encounter VIF plugging failures in _create_domain_and_network. However this currently leads to any local instance files being removed as cleanup is called with destroy_disks=True.

As such any failures when resuming or restarting an instance will lead
to these files being removed and the instance left in an unbootable
state. IMHO these files should only be removed when cleaning up after
errors hit while initially spawning an instance.

Steps to reproduce
==================
- Boot an instance using local disks
- Stop the instance
- Start the instance, causing a timeout or other failure during plug_vifs
- Attempt to start the instance again

Expected result
===============
The local instance files are left on the host if instances are rebooting or resuming.

Actual result
=============
The local instance files are removed from the host if _cleanup_failed_start is called.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   list for all releases: http://docs.openstack.org/releases/

   $ pwd
   /opt/stack/nova
   $ git rev-parse HEAD
   42222969a21ee28ef4a68bd5ab1ec8a12c4ad126


2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

$ nova boot --image cirros-0.3.4-x86_64-uec --flavor 1 test-boot
[..]
$ nova stop test-boot
$ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
total 18596
-rw-rw-r--. 1 root  root        16699 Jan 31 09:30 console.log
-rw-r--r--. 1 root  root     10289152 Jan 31 09:30 disk
-rw-r--r--. 1 stack libvirtd      257 Jan 31 09:29 disk.info
-rw-rw-r--. 1 qemu  qemu      4979632 Jan 31 09:29 kernel
-rw-rw-r--. 1 qemu  qemu      3740163 Jan 31 09:29 ramdisk

I used the following change to artificially recreate an issue plugging
the VIFs :

$ git diff
diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 33e3157..248e960 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -5015,6 +5015,7 @@ class LibvirtDriver(driver.ComputeDriver):
         pause = bool(events)
         guest = None
         try:
+            raise exception.VirtualInterfaceCreateException()
             with self.virtapi.wait_for_instance_event(
                     instance, events, deadline=timeout,
                     error_callback=self._neutron_failed_callback):

$ nova start test-boot
Request to start server test-boot has been accepted.
$ nova list
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
| ID                                   | Name      | Status  | Task State | Power State | Networks                       |
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
| be6cb386-e005-4fb2-8332-7e0c375ee452 | test-boot | SHUTOFF | -          | Shutdown    | public=172.24.4.8, 2001:db8::9 |
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
$ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
ls: cannot access '../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/': No such file or directory

Future attempts to start the instance will fail as a result :

$ nova start test-boot
Request to start server test-boot has been accepted.
$ vi ../logs/n-cpu.log
[..]
5353 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server Traceback (most recent call last):
5354 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
5355 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
5356 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
5357 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
5358 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
5359 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     result = func(ctxt, **new_args)
5360 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 75, in wrapped
5361 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     function_name, call_dict, binary)
5362 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5363 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
5364 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5365 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
5366 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 66, in wrapped
5367 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return f(self, context, *args, **kw)
5368 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 188, in decorated_function
5369 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     LOG.warning(msg, e, instance=instance)
5370 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5371 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
5372 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5373 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
5374 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 157, in decorated_function
5375 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
5376 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 685, in decorated_function
5377 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
5378 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 216, in decorated_function
5379 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
5380 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5381 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
5382 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5383 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
5384 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 204, in decorated_function
5385 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
5386 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2524, in start_instance
5387 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._power_on(context, instance)
5388 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2494, in _power_on
5389 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
5390 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2494, in power_on
5391 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._hard_reboot(context, instance, network_info, block_device_info)
5392 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2373, in _hard_reboot
5393 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
5394 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 6931, in _get_instance_disk_info
5395 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     dk_size = int(os.path.getsize(path))
5396 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib64/python2.7/genericpath.py", line 57, in getsize
5397 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return os.stat(filename).st_size
5398 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server OSError: [Errno 2] No such file or directory: '/opt/stack/data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/disk'
5399 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1660647

Title:
  _cleanup_failed_start aggressively removes local instance files when
  handling plug_vif failures

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1 [1] ensured that _cleanup_failed_start is always called if we encounter VIF plugging failures in _create_domain_and_network. However this currently leads to any local instance files being removed as cleanup is called with destroy_disks=True.

  As such any failures when resuming or restarting an instance will lead
  to these files being removed and the instance left in an unbootable
  state. IMHO these files should only be removed when cleaning up after
  errors hit while initially spawning an instance.

  Steps to reproduce
  ==================
  - Boot an instance using local disks
  - Stop the instance
  - Start the instance, causing a timeout or other failure during plug_vifs
  - Attempt to start the instance again

  Expected result
  ===============
  The local instance files are left on the host if instances are rebooting or resuming.

  Actual result
  =============
  The local instance files are removed from the host if _cleanup_failed_start is called.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
     list for all releases: http://docs.openstack.org/releases/

     $ pwd
     /opt/stack/nova
     $ git rev-parse HEAD
     42222969a21ee28ef4a68bd5ab1ec8a12c4ad126

  
  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt + KVM

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     N/A

  Logs & Configs
  ==============

  $ nova boot --image cirros-0.3.4-x86_64-uec --flavor 1 test-boot
  [..]
  $ nova stop test-boot
  $ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
  total 18596
  -rw-rw-r--. 1 root  root        16699 Jan 31 09:30 console.log
  -rw-r--r--. 1 root  root     10289152 Jan 31 09:30 disk
  -rw-r--r--. 1 stack libvirtd      257 Jan 31 09:29 disk.info
  -rw-rw-r--. 1 qemu  qemu      4979632 Jan 31 09:29 kernel
  -rw-rw-r--. 1 qemu  qemu      3740163 Jan 31 09:29 ramdisk

  I used the following change to artificially recreate an issue plugging
  the VIFs :

  $ git diff
  diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
  index 33e3157..248e960 100644
  --- a/nova/virt/libvirt/driver.py
  +++ b/nova/virt/libvirt/driver.py
  @@ -5015,6 +5015,7 @@ class LibvirtDriver(driver.ComputeDriver):
           pause = bool(events)
           guest = None
           try:
  +            raise exception.VirtualInterfaceCreateException()
               with self.virtapi.wait_for_instance_event(
                       instance, events, deadline=timeout,
                       error_callback=self._neutron_failed_callback):

  $ nova start test-boot
  Request to start server test-boot has been accepted.
  $ nova list
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  | ID                                   | Name      | Status  | Task State | Power State | Networks                       |
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  | be6cb386-e005-4fb2-8332-7e0c375ee452 | test-boot | SHUTOFF | -          | Shutdown    | public=172.24.4.8, 2001:db8::9 |
  +--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
  $ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
  ls: cannot access '../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/': No such file or directory

  Future attempts to start the instance will fail as a result :

  $ nova start test-boot
  Request to start server test-boot has been accepted.
  $ vi ../logs/n-cpu.log
  [..]
  5353 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server Traceback (most recent call last):
  5354 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
  5355 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  5356 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
  5357 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  5358 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
  5359 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  5360 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 75, in wrapped
  5361 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     function_name, call_dict, binary)
  5362 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5363 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5364 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5365 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5366 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 66, in wrapped
  5367 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  5368 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 188, in decorated_function
  5369 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     LOG.warning(msg, e, instance=instance)
  5370 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5371 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5372 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5373 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5374 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 157, in decorated_function
  5375 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5376 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 685, in decorated_function
  5377 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5378 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 216, in decorated_function
  5379 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
  5380 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  5381 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self.force_reraise()
  5382 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  5383 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
  5384 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 204, in decorated_function
  5385 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
  5386 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2524, in start_instance
  5387 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._power_on(context, instance)
  5388 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2494, in _power_on
  5389 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
  5390 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2494, in power_on
  5391 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     self._hard_reboot(context, instance, network_info, block_device_info)
  5392 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2373, in _hard_reboot
  5393 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     block_device_info)  
  5394 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 6931, in _get_instance_disk_info
  5395 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     dk_size = int(os.path.getsize(path))
  5396 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server   File "/usr/lib64/python2.7/genericpath.py", line 57, in getsize
  5397 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server     return os.stat(filename).st_size
  5398 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server OSError: [Errno 2] No such file or directory: '/opt/stack/data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/disk'
  5399 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1660647/+subscriptions


Follow ups