yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #35109
[Bug 1450594] Re: Instance deletion fails sometimes when serial_console is enabled
** Tags added: kilo-backport-potential
** Tags added: juno-backport-potential
** Also affects: nova/juno
Importance: Undecided
Status: New
** Also affects: nova/kilo
Importance: Undecided
Status: New
** Changed in: nova
Importance: Low => Medium
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1450594
Title:
Instance deletion fails sometimes when serial_console is enabled
Status in OpenStack Compute (Nova):
In Progress
Status in OpenStack Compute (nova) juno series:
New
Status in OpenStack Compute (nova) kilo series:
New
Bug description:
Nova Version: 2014.2.1
For situations where nova-compute is re-trying an instance delete
after the original delete failed, and the serial console feature is
enabled, the instance delete fails with:
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1179, in cleanup
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] for host, port in self._get_serial_ports_from_instance(instance):
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1197, in _get_serial_ports_from_instance
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] virt_dom = self._lookup_by_name(instance['name'])
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4195, in _lookup_by_name
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] raise exception.InstanceNotFound(instance_id=instance_name)
2015-04-27 16:54:49.900 114127 TRACE nova.compute.manager [instance: 6d117169-4057-4a4a-a0b7-0b12e996caa0] InstanceNotFound: Instance instance-00000444 could not be found.
Or, said another way, the _get_serial_ports_from_instance call should
maybe not cause an exception if the instance cannot be found.
More details/context:
In our particular situation, some instance deletes are initially
failing because the neutron port delete operation was failing or
timing out. So the VM goes to 'error' and remains in the deleting
task_state. However, since the failure is on the port delete, the
domain has already been undefined in libvirt. The first invocation of
_delete_instance calls shutdown_instance before an attempt is made to
delete the network. Shutdown_instance is able to successfully call
driver.destroy which will shutdown the instance and then runs the
cleanup action, ignoring any errors around vif removal. This will
undefine the domain as long as it was successfully shutdown.
The next time nova-compute is started, it finds the instance still in
the deleting task state, so it re-tries the delete. Part of the
cleanup call ran by driver.destroy is to remove the serial console.
Note: this was already ran and successfully deleted on the first
delete when the domain was successfully undefined. But since the
domain is no longer defined in libvirt, the
_get_serial_ports_from_instance call fails, and again the entire
delete operation fails and stops. This makes it impossible to fully
delete the instance.
When the serial console feature is disabled, this delete re-try
operation functions correctly and properly cleans up the rest of the
instance, and it transitions to deleted.
FWIW, we are also running nova-cells, so the neutron --> nova port
notifications do not work/are disabled. Don't know if that's relevant
or not.
Steps to reproduce:
- nova-compute configured with serial console feature enabled
- Create an instance which has a serial console configured
- Delete that instance, but cause the neutron port delete to fail or timeout (via iptables or just shutting off neutron temporarily)
- The instance should now be stuck in the deleting task state
- Restart nova-compute
- During the re-try of the delete operation, the above stack trace results.
Expected result:
Retries of instance deletions in this scenario should succeed with the
same behavior that happens when the serial console feature is
disabled.
Proposed Fix:
Under:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L761-L765
shorty above this create a variable called isdefined and set it to
true when we are checking to see if the domain is defined set the
variable isdefined to false
Under:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L848-L851
add a test to see if isdefined is false and if it is, do not attempt
to get the serial console for the nonexistent domain.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1450594/+subscriptions
References