← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2054329] Re: orphan allocations cause orphan resource providers and prevents compute service deletion

 

This is a known issue that we recently fixed by ensuring that you can't
change the hostname silently :
https://specs.openstack.org/openstack/nova-
specs/specs/2023.1/implemented/stable-compute-uuid.html

That series won't be backported to Zed so I'd recommend you to upgrade
to Antelope. In the meantime, you can do some janitory on the orphaned
resources by using the 'nova-manage placement audit' command which will
tell you which placement resources are zombies.


** Changed in: nova
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054329

Title:
  orphan allocations cause orphan resource providers and prevents
  compute service deletion

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Description
  ===========
  It can happen, that there are orphan allocations against a resource provider.
  E.g. when something went wrong during a migration.

  During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
  When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
  This causes orphan resource-providers.

  This is based on the try-catch around the deletion of the resource-provider:
  https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

  If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
  This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.

  If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
  This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.

  Steps to reproduce
  ==================
  1. Generate orphaned allocations on a resource provider
  Can be done by generating a random allocation:
  ```
  openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
  ```
  2. Delete the nova-compute-service via the nova-api
  3. Restart the nova-compute service, so a new nova-compute-service is created
  4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
  5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.

  Expected result
  ===============
  No erros in the logs regarding not finding a resource-provider based on its ID.
  The deletion of the recreated nova-compute-service should be succesfull.

  Actual result
  =============
  We see erros in the log regarding not finding the resource provider:
  ```
  An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
  ```
  We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.

  Environment
  ===========
  We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054329/+subscriptions



References