yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93535
[Bug 2054329] [NEW] orphan allocations cause orphan resource providers and prevents compute service deletion
Public bug reported:
Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.
During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.
This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321
If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.
If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.
Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.
Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.
Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.
Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.
** Affects: nova
Importance: Undecided
Status: New
** Description changed:
Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.
During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.
This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321
If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
- This causes a mismatch between the ID of the nova-compute-service and the resource provider.
+ This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.
If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.
Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.
Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.
Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.
Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054329
Title:
orphan allocations cause orphan resource providers and prevents
compute service deletion
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.
During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.
This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321
If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.
If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.
Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.
Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.
Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.
Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054329/+subscriptions
Follow ups