yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #02836
[Bug 1161193] Re: Compute manager fails to cleanup compute_nodes not reported by driver
** Changed in: nova
Status: Fix Committed => Fix Released
** Changed in: nova
Milestone: None => havana-1
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1161193
Title:
Compute manager fails to cleanup compute_nodes not reported by driver
Status in OpenStack Compute (Nova):
Fix Released
Status in OpenStack Compute (nova) grizzly series:
Fix Released
Bug description:
When virt driver supports multiple nodes and one node is removed from
driver support the compute_nodes in DB are not synched with the driver
list. This will cause scheduler to pick bad host resulting in this
error:
| fault | {u'message': u'NovaException', u'code': 500, u'details': u'helium51 is not a valid node managed by this compute host. |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 223, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1149, in run_instance |
| | do_run_instance() |
| | File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 242, in inner |
| | retval = f(*args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1148, in do_run_instance |
| | admin_password, is_first_time, node, instance) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 802, in _run_instance |
| | self._set_instance_error_state(context, instance[\'uuid\']) |
| | File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ |
| | self.gen.next() |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 756, in _run_instance |
| | rt = self._get_resource_tracker(node) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 353, in _get_resource_tracker |
| | raise exception.NovaException(msg) |
| | ', u'created': u'2013-03-06T16:47:52Z'}
Two things I see in the code:
first the list of known hosts is not reflecting the DB list but a list
from driver.get_available_nodes:
known_nodes = set(self._resource_tracker_dict.keys())
Which then will never yield orphan compute_nodes in this statement:
for nodename in known_nodes - nodenames
Secondly, even if we fix to get known_nodes from the DB through
conductor
This code will always raise and exception:
for nodename in known_nodes - nodenames:
rt = self._get_resource_tracker(nodename)
rt.update_available_resource(context, delete=True)
because _get_resource_tracker will always check the nodename is in
driver.get_available_nodes
To replicate this you could just change your hypervisor_hostname which
will create a new record in nova.compute_nodes table leaving the old
record around. This will simulate a compute node that is not supported
anymore in a multi-node scenario.
Suggestion:
Remove logic to delete orphan compute_nodes from compute.manager and
move to compute.resource_tracker under the _sync_compute_node method
which already loops through all compute_nodes records for compute
service
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1161193/+subscriptions