← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1786703] [NEW] Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust

 

Public bug reported:

NOTE: This may be just a postgresql problem, not sure.

When doing some further experiments with load testing placement, my
resource provider create script, which uses asyncio was able to cause
several 500 errors from the placement service of the following form:

```
cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API unexpected error: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.IntegrityError) duplicate key value violates unique constraint "uniq_placement_aggregates0uuid"
2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 500 len: 997 microversion: 1.29
```

"DETAIL:  Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
exists."


This is because the code at https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529 is not trapping the right error when the server thinks it needs to create a new aggregate at the same time that it is already creating it.

It's not clear to me if this is because oslo_db is not transforming the
postgresql error properly or that the generic error there is the wrong
one and we've never noticed before because we don't hit the concurrency
situation hard enough.

** Affects: nova
     Importance: Medium
         Status: New


** Tags: db placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1786703

Title:
  Placement duplicate aggregate uuid handling during concurrent
  aggregate create insufficiently robust

Status in OpenStack Compute (nova):
  New

Bug description:
  NOTE: This may be just a postgresql problem, not sure.

  When doing some further experiments with load testing placement, my
  resource provider create script, which uses asyncio was able to cause
  several 500 errors from the placement service of the following form:

  ```
  cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
  2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
  2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API unexpected error: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.IntegrityError) duplicate key value violates unique constraint "uniq_placement_aggregates0uuid"
  2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 500 len: 997 microversion: 1.29
  ```

  "DETAIL:  Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
  exists."

  
  This is because the code at https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529 is not trapping the right error when the server thinks it needs to create a new aggregate at the same time that it is already creating it.

  It's not clear to me if this is because oslo_db is not transforming
  the postgresql error properly or that the generic error there is the
  wrong one and we've never noticed before because we don't hit the
  concurrency situation hard enough.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1786703/+subscriptions


Follow ups