← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1450594] Re: Instance deletion fails sometimes when serial_console is enabled

 

** Changed in: nova/kilo
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1450594

Title:
  Instance deletion fails sometimes when serial_console is enabled

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) juno series:
  New
Status in OpenStack Compute (nova) kilo series:
  Fix Released

Bug description:
  Nova Version:  2014.2.1

  For situations where nova-compute is re-trying an instance delete
  after the original delete failed, and the serial console feature is
  enabled, the instance delete fails with:

  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1179, in cleanup
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     for host, port in self._get_serial_ports_from_instance(instance):
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1197, in _get_serial_ports_from_instance
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     virt_dom = self._lookup_by_name(instance['name'])
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4195, in _lookup_by_name
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0]     raise exception.InstanceNotFound(instance_id=instance_name)
  2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] InstanceNotFound: Instance instance-00000444 could not be found.

  Or, said another way, the _get_serial_ports_from_instance call should
  maybe not cause an exception if the instance cannot be found.

  More details/context:

  In our particular situation, some instance deletes are initially
  failing because the neutron port delete operation was failing or
  timing out.  So the VM goes to 'error' and remains in the deleting
  task_state.  However, since the failure is on the port delete, the
  domain has already been undefined in libvirt. The first invocation of
  _delete_instance calls shutdown_instance before an attempt is made to
  delete the network. Shutdown_instance is able to successfully call
  driver.destroy which will shutdown the instance and then runs the
  cleanup action, ignoring any errors around vif removal. This will
  undefine the domain as long as it was successfully shutdown.

  The next time nova-compute is started, it finds the instance still in
  the deleting task state, so it re-tries the delete.  Part of the
  cleanup call ran by driver.destroy is to remove the serial console.
  Note: this was already ran and successfully deleted on the first
  delete when the domain was successfully undefined.  But since the
  domain is no longer defined in libvirt, the
  _get_serial_ports_from_instance call fails, and again the entire
  delete operation fails and stops.  This makes it impossible to fully
  delete the instance.

  When the serial console feature is disabled, this delete re-try
  operation functions correctly and properly cleans up the rest of the
  instance, and it transitions to deleted.

  FWIW, we are also running nova-cells, so the neutron --> nova port
  notifications do not work/are disabled.  Don't know if that's relevant
  or not.


  Steps to reproduce:

  - nova-compute configured with serial console feature enabled
  - Create an instance which has a serial console configured
  - Delete that instance, but cause the neutron port delete to fail or timeout (via iptables or just shutting off neutron temporarily)
  - The instance should now be stuck in the deleting task state
  - Restart nova-compute
  - During the re-try of the delete operation, the above stack trace results.


  Expected result:

  Retries of instance deletions in this scenario should succeed with the
  same behavior that happens when the serial console feature is
  disabled.


  Proposed Fix:

  Under:
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L761-L765
  shorty above this create a variable called isdefined and set it to
  true when we are checking to see if the domain is defined set the
  variable isdefined to false

  Under:
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L848-L851
  add a test to see if isdefined is false and if it is, do not attempt
  to get the serial console for the nonexistent domain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1450594/+subscriptions


References