yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1721652] [NEW] Evacuate cleanup fails at _delete_allocation_for_moved_instance

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Charles Volzka <cjvolzka@xxxxxxxxxx>
Date: Thu, 05 Oct 2017 21:58:55 -0000
Reply-to: Bug 1721652 <1721652@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========
After an evacuation, when nova-compute is restarted on the source host, the clean up of the old instance on the source host fails. The traceback in nova-compute.log ends with:
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in _destroy_evacuated_instances
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     instance, migration.source_node)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, in delete_allocation_for_evacuated_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     instance, node, 'evacuated', node_type)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, in _delete_allocation_for_moved_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     cn_uuid = self.compute_nodes[node].uuid
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: u'<SOURCE_HOST_NAME>'
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service


Steps to reproduce
==================
Deploy instance on Host A.
Shut down Host A.
Evacuate instance to Host B.
Turn back on Host A.
Wait for cleanup of old instance allocation to occur

Expected result
===============
Clean up of old instance from Host A is successful

Actual result
=============
Old instance clean up appears to work but there's a traceback in the log and allocation is not cleaned up.

Environment
===========
(pike)nova-compute/now 10:16.0.0-201710030907


Additional Info:
================
Problem seems to come from this change: https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87 at:
rt = self._get_resource_tracker()
rt.delete_allocation_for_evacuated_instance
That is called very early in init_host flow to clean up the allocations. The problem is that at this point in the startup the resource tracker's self.compute_node is still None. That makes delete_allocation_for_evacuated_instance blow up with a key error at:
cn_uuid = self.compute_nodes[node].uuid
The resource tracker's self.compute_node is actually initialized later on in the startup process via the update_available_resources() -> _update_available_resources() -> _init_compute_node(). It isn't initialized when the tracker is first created which appears to be the assumption made by the referenced commit.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1721652

Title:
  Evacuate cleanup fails at _delete_allocation_for_moved_instance

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  After an evacuation, when nova-compute is restarted on the source host, the clean up of the old instance on the source host fails. The traceback in nova-compute.log ends with:
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in _destroy_evacuated_instances
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     instance, migration.source_node)
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, in delete_allocation_for_evacuated_instance
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     instance, node, 'evacuated', node_type)
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, in _delete_allocation_for_moved_instance
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service     cn_uuid = self.compute_nodes[node].uuid
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: u'<SOURCE_HOST_NAME>'
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service


  Steps to reproduce
  ==================
  Deploy instance on Host A.
  Shut down Host A.
  Evacuate instance to Host B.
  Turn back on Host A.
  Wait for cleanup of old instance allocation to occur

  Expected result
  ===============
  Clean up of old instance from Host A is successful

  Actual result
  =============
  Old instance clean up appears to work but there's a traceback in the log and allocation is not cleaned up.

  Environment
  ===========
  (pike)nova-compute/now 10:16.0.0-201710030907

  
  Additional Info:
  ================
  Problem seems to come from this change: https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87 at:
  rt = self._get_resource_tracker()
  rt.delete_allocation_for_evacuated_instance
  That is called very early in init_host flow to clean up the allocations. The problem is that at this point in the startup the resource tracker's self.compute_node is still None. That makes delete_allocation_for_evacuated_instance blow up with a key error at:
  cn_uuid = self.compute_nodes[node].uuid
  The resource tracker's self.compute_node is actually initialized later on in the startup process via the update_available_resources() -> _update_available_resources() -> _init_compute_node(). It isn't initialized when the tracker is first created which appears to be the assumption made by the referenced commit.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1721652/+subscriptions
Follow ups

[Bug 1721652] Re: Evacuate cleanup fails at _delete_allocation_for_moved_instance
From: OpenStack Infra, 2017-10-17
[Bug 1721652] Re: Evacuate cleanup fails at _delete_allocation_for_moved_instance
From: Matt Riedemann, 2017-10-06