yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2054329] [NEW] orphan allocations cause orphan resource providers and prevents compute service deletion

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Robert Franzke <2054329@xxxxxxxxxxxxxxxxxx>
Date: Mon, 19 Feb 2024 16:23:12 -0000
Reply-to: Bug 2054329 <2054329@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.

During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.

This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.

If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.

Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.

Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.

Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.

Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  It can happen, that there are orphan allocations against a resource provider.
  E.g. when something went wrong during a migration.
  
  During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
  When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
  This causes orphan resource-providers.
  
  This is based on the try-catch around the deletion of the resource-provider:
  https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321
  
  If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
- This causes a mismatch between the ID of the nova-compute-service and the resource provider.
+ This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.
  
  If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
  This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.
  
  Steps to reproduce
  ==================
  1. Generate orphaned allocations on a resource provider
  Can be done by generating a random allocation:
  ```
  openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
  ```
  2. Delete the nova-compute-service via the nova-api
  3. Restart the nova-compute service, so a new nova-compute-service is created
  4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
  5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.
  
  Expected result
  ===============
  No erros in the logs regarding not finding a resource-provider based on its ID.
  The deletion of the recreated nova-compute-service should be succesfull.
  
  Actual result
  =============
  We see erros in the log regarding not finding the resource provider:
  ```
  An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
  ```
  We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.
  
  Environment
  ===========
  We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054329

Title:
  orphan allocations cause orphan resource providers and prevents
  compute service deletion

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  It can happen, that there are orphan allocations against a resource provider.
  E.g. when something went wrong during a migration.

  During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
  When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
  This causes orphan resource-providers.

  This is based on the try-catch around the deletion of the resource-provider:
  https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

  If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
  This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.

  If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
  This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.

  Steps to reproduce
  ==================
  1. Generate orphaned allocations on a resource provider
  Can be done by generating a random allocation:
  ```
  openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
  ```
  2. Delete the nova-compute-service via the nova-api
  3. Restart the nova-compute service, so a new nova-compute-service is created
  4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
  5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.

  Expected result
  ===============
  No erros in the logs regarding not finding a resource-provider based on its ID.
  The deletion of the recreated nova-compute-service should be succesfull.

  Actual result
  =============
  We see erros in the log regarding not finding the resource provider:
  ```
  An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
  ```
  We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.

  Environment
  ===========
  We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054329/+subscriptions
Follow ups

[Bug 2054329] Re: orphan allocations cause orphan resource providers and prevents compute service deletion
From: Sylvain Bauza, 2024-03-19