yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1786703] Re: Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1786703@xxxxxxxxxxxxxxxxxx>
Date: Thu, 23 Aug 2018 16:08:37 -0000
Reply-to: Bug 1786703 <1786703@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.openstack.org/592654
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Submitter: Zuul
Branch:    master

commit 2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Author: Jay Pipes <jaypipes@xxxxxxxxx>
Date:   Thu Aug 16 14:56:47 2018 -0400

    placement: use single-shot INSERT/DELETE agg
    
    When replacing a provider's set of aggregate associations, we were
    issuing a call to:
    
     DELETE resource_provider_aggregates WHERE resource_provider_id = $rp
    
    and then a single call to:
    
     INSERT INTO resource_provider_aggregates
     SELECT $rp, aggs.id
     FROM provider_aggregates AS aggs
     WHERE aggs.uuid IN ($agg_uuids)
    
    This patch changes the _set_aggregates() function in a few ways.
    First, we grab the aggregate's internal ID value when creating new
    aggregate records (or grabbing a provider's existing aggregate
    associations). This eliminates the need for any join to
    provider_aggregates in an INSERT/DELETE statement.
    
    Second, instead of a multi-row INSERT .. SELECT statement, we do
    single-shot INSERT ... VALUES statements, one for each added aggregate.
    
    Third, we no longer DELETE all aggregate associations for the provider
    in question. Instead, we issue single-shot DELETE statements for only
    the aggregates that are being disassociated.
    
    Finally, I've added a number of log debug statements so that we can have
    a little more information if this particular patch does not fix the
    deadlock issue described in the associated bug.
    
    Change-Id: I87e765305017eae1424005f7d6f419f42a2f8370
    Closes-bug: #1786703


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1786703

Title:
  Placement duplicate aggregate uuid handling during concurrent
  aggregate create insufficiently robust

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  NOTE: This may be just a postgresql problem, not sure.

  When doing some further experiments with load testing placement, my
  resource provider create script, which uses asyncio was able to cause
  several 500 errors from the placement service of the following form:

  ```
  cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
  2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
  2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API unexpected error: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.IntegrityError) duplicate key value violates unique constraint "uniq_placement_aggregates0uuid"
  2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 500 len: 997 microversion: 1.29
  ```

  "DETAIL:  Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
  exists."

  
  This is because the code at https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529 is not trapping the right error when the server thinks it needs to create a new aggregate at the same time that it is already creating it.

  It's not clear to me if this is because oslo_db is not transforming
  the postgresql error properly or that the generic error there is the
  wrong one and we've never noticed before because we don't hit the
  concurrency situation hard enough.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1786703/+subscriptions

References

[Bug 1786703] [NEW] Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust
From: Chris Dent, 2018-08-12