← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1834712] [NEW] ResourceTracker._update should restore previous old_resources value if ComputeNode.save fails

 

Public bug reported:

This is a follow up to bug 1834694 with the debug information here:

https://review.opendev.org/#/c/668252/1/nova/scheduler/host_manager.py@626

This is on an overloaded system where conductor and mysql are having
problems and database connections are getting dropped.

On the first start of the compute service, the compute node record is
created without the free_disk_gb field set.

Later in the _update() method in ResourceTracker the _resource_change
method returns True and updates the self.old_resources value:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L908

Then the ComputeNode.save() fails with a DB error here:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1010

That kills the update_available_resource run but doesn't kill the
service because:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L8130

Later when update_available_resource runs, _resource_change does not
detect any changes here because old_resources was updated before:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L906

So we don't try to call ComputeNode.save() again but instead call
_update_to_placement here:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1012

This can create the resource provider with inventory in the placement
service.

As a result, the scheduler can get the compute node resource provider
back from placement even though it's not updated which results in
hitting this code in the scheduler:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/scheduler/host_manager.py#L193

That leaves some of the HostState fields unset which in turn results in
issues like bug 1834691 and bug 1834694.

We could deal with the RT issues in a few ways, like not allowing the
compute service to start if we can't create and update the compute node
(rather than just catch and swallow Exception in the ComputeManager),
but that might have other side effects. An easy thing to do here is make
sure to rollback the changes to old_resources in the RT if
compute_node.save() fails.

** Affects: nova
     Importance: Medium
         Status: Triaged


** Tags: resource-tracker

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1834712

Title:
  ResourceTracker._update should restore previous old_resources value if
  ComputeNode.save fails

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  This is a follow up to bug 1834694 with the debug information here:

  https://review.opendev.org/#/c/668252/1/nova/scheduler/host_manager.py@626

  This is on an overloaded system where conductor and mysql are having
  problems and database connections are getting dropped.

  On the first start of the compute service, the compute node record is
  created without the free_disk_gb field set.

  Later in the _update() method in ResourceTracker the _resource_change
  method returns True and updates the self.old_resources value:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L908

  Then the ComputeNode.save() fails with a DB error here:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1010

  That kills the update_available_resource run but doesn't kill the
  service because:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L8130

  Later when update_available_resource runs, _resource_change does not
  detect any changes here because old_resources was updated before:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L906

  So we don't try to call ComputeNode.save() again but instead call
  _update_to_placement here:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1012

  This can create the resource provider with inventory in the placement
  service.

  As a result, the scheduler can get the compute node resource provider
  back from placement even though it's not updated which results in
  hitting this code in the scheduler:

  https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/scheduler/host_manager.py#L193

  That leaves some of the HostState fields unset which in turn results
  in issues like bug 1834691 and bug 1834694.

  We could deal with the RT issues in a few ways, like not allowing the
  compute service to start if we can't create and update the compute
  node (rather than just catch and swallow Exception in the
  ComputeManager), but that might have other side effects. An easy thing
  to do here is make sure to rollback the changes to old_resources in
  the RT if compute_node.save() fails.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1834712/+subscriptions


Follow ups