← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1450594] [NEW] Instance deletion fails sometimes when serial_console is enabled

 

Public bug reported:

Nova Version:  2014.2.1

For situations where nova-compute is re-trying an instance delete after
the original delete failed, and the serial console feature is enabled,
the instance delete fails with:

2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1179, in cleanup
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     for host, port in self._get_serial_ports_from_instance(instance):
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1197, in _get_serial_ports_from_instance
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     virt_dom = self._lookup_by_name(instance['name'])
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4195, in _lookup_by_name
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     raise exception.InstanceNotFound(instance_id=instance_name)
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] InstanceNotFound: Instance instance-00000444 could not be found.

Or, said another way, the _get_serial_ports_from_instance call should
maybe not cause an exception if the instance cannot be found.

More details/context:

In our particular situation, some instance deletes are initially failing
because the neutron port delete operation was failing or timing out.  So
the VM goes to 'error' and remains in the deleting task_state.  However,
since the failure is on the port delete, the domain has already been
undefined in libvirt. The first invocation of _delete_instance calls
shutdown_instance before an attempt is made to delete the network.
Shutdown_instance is able to successfully call driver.destroy which will
shutdown the instance and then runs the cleanup action, ignoring any
errors around vif removal. This will undefine the domain as long as it
was successfully shutdown.

The next time nova-compute is started, it finds the instance still in
the deleting task state, so it re-tries the delete.  Part of the cleanup
call ran by driver.destroy is to remove the serial console.  Note: this
was already ran and successfully deleted on the first delete when the
domain was successfully undefined.  But since the domain is no longer
defined in libvirt, the _get_serial_ports_from_instance call fails, and
again the entire delete operation fails and stops.  This makes it
impossible to fully delete the instance.

When the serial console feature is disabled, this delete re-try
operation functions correctly and properly cleans up the rest of the
instance, and it transitions to deleted.

FWIW, we are also running nova-cells, so the neutron --> nova port
notifications do not work/are disabled.  Don't know if that's relevant
or not.


Steps to reproduce:

- nova-compute configured with serial console feature enabled
- Create an instance which has a serial console configured
- Delete that instance, but cause the neutron port delete to fail or timeout (via iptables or just shutting off neutron temporarily)
- The instance should now be stuck in the deleting task state
- Restart nova-compute
- During the re-try of the delete operation, the above stack trace results.


Expected result:

Retries of instance deletions in this scenario should succeed with the
same behavior that happens when the serial console feature is disabled.


Proposed Fix:

Under:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L761-L765
shorty above this create a variable called isdefined and set it to true
when we are checking to see if the domain is defined set the variable
isdefined to false

Under:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L848-L851
add a test to see if isdefined is false and if it is, do not attempt to
get the serial console for the nonexistent domain.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1450594

Title:
  Instance deletion fails sometimes when serial_console is enabled

Status in OpenStack Compute (Nova):
  New

Bug description:
  Nova Version:  2014.2.1

  For situations where nova-compute is re-trying an instance delete
  after the original delete failed, and the serial console feature is
  enabled, the instance delete fails with:

  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1179, in cleanup
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     for host, port in self._get_serial_ports_from_instance(instance):
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1197, in _get_serial_ports_from_instance
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     virt_dom = self._lookup_by_name(instance['name'])
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4195, in _lookup_by_name
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     raise exception.InstanceNotFound(instance_id=instance_name)
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] InstanceNotFound: Instance instance-00000444 could not be found.

  Or, said another way, the _get_serial_ports_from_instance call should
  maybe not cause an exception if the instance cannot be found.

  More details/context:

  In our particular situation, some instance deletes are initially
  failing because the neutron port delete operation was failing or
  timing out.  So the VM goes to 'error' and remains in the deleting
  task_state.  However, since the failure is on the port delete, the
  domain has already been undefined in libvirt. The first invocation of
  _delete_instance calls shutdown_instance before an attempt is made to
  delete the network. Shutdown_instance is able to successfully call
  driver.destroy which will shutdown the instance and then runs the
  cleanup action, ignoring any errors around vif removal. This will
  undefine the domain as long as it was successfully shutdown.

  The next time nova-compute is started, it finds the instance still in
  the deleting task state, so it re-tries the delete.  Part of the
  cleanup call ran by driver.destroy is to remove the serial console.
  Note: this was already ran and successfully deleted on the first
  delete when the domain was successfully undefined.  But since the
  domain is no longer defined in libvirt, the
  _get_serial_ports_from_instance call fails, and again the entire
  delete operation fails and stops.  This makes it impossible to fully
  delete the instance.

  When the serial console feature is disabled, this delete re-try
  operation functions correctly and properly cleans up the rest of the
  instance, and it transitions to deleted.

  FWIW, we are also running nova-cells, so the neutron --> nova port
  notifications do not work/are disabled.  Don't know if that's relevant
  or not.


  Steps to reproduce:

  - nova-compute configured with serial console feature enabled
  - Create an instance which has a serial console configured
  - Delete that instance, but cause the neutron port delete to fail or timeout (via iptables or just shutting off neutron temporarily)
  - The instance should now be stuck in the deleting task state
  - Restart nova-compute
  - During the re-try of the delete operation, the above stack trace results.


  Expected result:

  Retries of instance deletions in this scenario should succeed with the
  same behavior that happens when the serial console feature is
  disabled.


  Proposed Fix:

  Under:
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L761-L765
  shorty above this create a variable called isdefined and set it to
  true when we are checking to see if the domain is defined set the
  variable isdefined to false

  Under:
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L848-L851
  add a test to see if isdefined is false and if it is, do not attempt
  to get the serial console for the nonexistent domain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1450594/+subscriptions


Follow ups

References