yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1161193] Re: Compute manager fails to cleanup compute_nodes not reported by driver

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Alan Pevec <1161193@xxxxxxxxxxxxxxxxxx>
Date: Fri, 10 May 2013 00:06:03 -0000
Reply-to: Bug 1161193 <1161193@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Changed in: nova/grizzly
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1161193

Title:
  Compute manager fails to cleanup compute_nodes not reported by driver

Status in OpenStack Compute (Nova):
  Fix Committed
Status in OpenStack Compute (nova) grizzly series:
  Fix Released

Bug description:
  When virt driver supports multiple nodes and one node is removed from
  driver support the compute_nodes in DB are not synched with the driver
  list. This will cause scheduler to pick bad host resulting in this
  error:

  | fault | {u'message': u'NovaException', u'code': 500, u'details': u'helium51 is not a valid node managed by this compute host. |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 223, in decorated_function |
  | | return function(self, context, *args, **kwargs) |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1149, in run_instance |
  | | do_run_instance() |
  | | File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 242, in inner |
  | | retval = f(*args, **kwargs) |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1148, in do_run_instance |
  | | admin_password, is_first_time, node, instance) |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 802, in _run_instance |
  | | self._set_instance_error_state(context, instance[\'uuid\']) |
  | | File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ |
  | | self.gen.next() |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 756, in _run_instance |
  | | rt = self._get_resource_tracker(node) |
  | | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 353, in _get_resource_tracker |
  | | raise exception.NovaException(msg) |
  | | ', u'created': u'2013-03-06T16:47:52Z'} 

  Two things I see in the code:

  first the list of known hosts is not reflecting the DB list but a list
  from driver.get_available_nodes:

  known_nodes = set(self._resource_tracker_dict.keys())

  Which then will never yield orphan compute_nodes in this statement:

  for nodename in known_nodes - nodenames

  Secondly, even if we fix to get known_nodes from the DB through
  conductor

  This code will always raise and exception:

  for nodename in known_nodes - nodenames:
      rt = self._get_resource_tracker(nodename)
      rt.update_available_resource(context, delete=True)

  because _get_resource_tracker will always check the nodename is in
  driver.get_available_nodes

  To replicate this you could just change your hypervisor_hostname which
  will create a new record in nova.compute_nodes table leaving the old
  record around. This will simulate a compute node that is not supported
  anymore in a multi-node scenario.

  Suggestion:

  Remove logic to delete orphan compute_nodes from compute.manager and
  move to compute.resource_tracker under the _sync_compute_node method
  which already loops through all compute_nodes records for compute
  service

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1161193/+subscriptions