yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #74410
[Bug 1786703] Re: Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust
Reviewed: https://review.openstack.org/592654
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Submitter: Zuul
Branch: master
commit 2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Author: Jay Pipes <jaypipes@xxxxxxxxx>
Date: Thu Aug 16 14:56:47 2018 -0400
placement: use single-shot INSERT/DELETE agg
When replacing a provider's set of aggregate associations, we were
issuing a call to:
DELETE resource_provider_aggregates WHERE resource_provider_id = $rp
and then a single call to:
INSERT INTO resource_provider_aggregates
SELECT $rp, aggs.id
FROM provider_aggregates AS aggs
WHERE aggs.uuid IN ($agg_uuids)
This patch changes the _set_aggregates() function in a few ways.
First, we grab the aggregate's internal ID value when creating new
aggregate records (or grabbing a provider's existing aggregate
associations). This eliminates the need for any join to
provider_aggregates in an INSERT/DELETE statement.
Second, instead of a multi-row INSERT .. SELECT statement, we do
single-shot INSERT ... VALUES statements, one for each added aggregate.
Third, we no longer DELETE all aggregate associations for the provider
in question. Instead, we issue single-shot DELETE statements for only
the aggregates that are being disassociated.
Finally, I've added a number of log debug statements so that we can have
a little more information if this particular patch does not fix the
deadlock issue described in the associated bug.
Change-Id: I87e765305017eae1424005f7d6f419f42a2f8370
Closes-bug: #1786703
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1786703
Title:
Placement duplicate aggregate uuid handling during concurrent
aggregate create insufficiently robust
Status in OpenStack Compute (nova):
Fix Released
Bug description:
NOTE: This may be just a postgresql problem, not sure.
When doing some further experiments with load testing placement, my
resource provider create script, which uses asyncio was able to cause
several 500 errors from the placement service of the following form:
```
cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API unexpected error: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.IntegrityError) duplicate key value violates unique constraint "uniq_placement_aggregates0uuid"
2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 500 len: 997 microversion: 1.29
```
"DETAIL: Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
exists."
This is because the code at https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529 is not trapping the right error when the server thinks it needs to create a new aggregate at the same time that it is already creating it.
It's not clear to me if this is because oslo_db is not transforming
the postgresql error properly or that the generic error there is the
wrong one and we've never noticed before because we don't hit the
concurrency situation hard enough.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1786703/+subscriptions
References