← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1779931] [NEW] Provider update race between host aggregate sync and resource tracker

 

Public bug reported:

The resource tracker (in n-cpu) used to be the only place we were
pushing changes to placement, all funneled through a single mutex
(COMPUTE_RESOURCE_SEMAPHORE) to prevent conflicts.

When we started mirroring host aggregates as placement aggregates [1],
which happens in the n-api process, we introduced races with the
resource tracker e.g. as follows:

n-api: aggregate_add_host => _get_provider_by_name [2]
n-cpu: get_provider_tree_and_ensure_root [3]
n-api: set_aggregates_for_provider [4]
n-cpu: update_from_provider_tree [5] => set_aggregates_for_provider [6]

(similar for aggregate_remove_host)

Whoever gets to set_aggregates_for_provider first will push their view
of the aggregates to placement.  Until we start checking for generation
conflicts in set_aggregates_for_provider, whoever gets there second will
simply blow away the first one.  Therefore it won't cause failures and
we wouldn't notice.

Once we do start checking for generation conflicts in
set_aggregates_for_provider [7], we start seeing actual failures, like:

 tempest.api.compute.admin.test_aggregates.AggregatesAdminTestJSON.test_aggregate_add_host_get_details[id-eeef473c-7c52-494d-9f09-2ed7fc8fc036]
 ----------------------------------------------------------------------------------------------------------------------------------------------

 Captured traceback-1:
 ~~~~~~~~~~~~~~~~~~~~~
     Traceback (most recent call last):
       File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
         return func(*args, **kwargs)
       File "tempest/lib/services/compute/aggregates_client.py", line 70, in delete_aggregate
         resp, body = self.delete("os-aggregates/%s" % aggregate_id)
       File "tempest/lib/common/rest_client.py", line 310, in delete
         return self.request('DELETE', url, extra_headers, headers, body)
       File "tempest/lib/services/compute/base_compute_client.py", line 48, in request
         method, url, extra_headers, headers, body, chunked)
       File "tempest/lib/common/rest_client.py", line 668, in request
         self._error_checker(resp, resp_body)
       File "tempest/lib/common/rest_client.py", line 779, in _error_checker
         raise exceptions.BadRequest(resp_body, resp=resp)
     tempest.lib.exceptions.BadRequest: Bad request
     Details: {u'code': 400, u'message': u'Cannot remove host from aggregate 2. Reason: Host aggregate is not empty.'}

...

 Captured traceback:
 ~~~~~~~~~~~~~~~~~~~
     Traceback (most recent call last):
       File "tempest/api/compute/admin/test_aggregates.py", line 193, in test_aggregate_add_host_get_details
         self.client.add_host(aggregate['id'], host=self.host)
       File "tempest/lib/services/compute/aggregates_client.py", line 95, in add_host
         post_body)
       File "tempest/lib/common/rest_client.py", line 279, in post
         return self.request('POST', url, extra_headers, headers, body, chunked)
       File "tempest/lib/services/compute/base_compute_client.py", line 48, in request
         method, url, extra_headers, headers, body, chunked)
       File "tempest/lib/common/rest_client.py", line 668, in request
         self._error_checker(resp, resp_body)
       File "tempest/lib/common/rest_client.py", line 845, in _error_checker
         message=message)
     tempest.lib.exceptions.ServerFault: Got server fault
     Details: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
     <class 'nova.exception.ResourceProviderUpdateConflict'>

[1] https://review.openstack.org/#/c/553597/
[2] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1935
[3] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L883
[4] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1956
[5] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L897
[6] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1454
[7] https://review.openstack.org/#/c/556669/

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1779931

Title:
  Provider update race between host aggregate sync and resource tracker

Status in OpenStack Compute (nova):
  New

Bug description:
  The resource tracker (in n-cpu) used to be the only place we were
  pushing changes to placement, all funneled through a single mutex
  (COMPUTE_RESOURCE_SEMAPHORE) to prevent conflicts.

  When we started mirroring host aggregates as placement aggregates [1],
  which happens in the n-api process, we introduced races with the
  resource tracker e.g. as follows:

  n-api: aggregate_add_host => _get_provider_by_name [2]
  n-cpu: get_provider_tree_and_ensure_root [3]
  n-api: set_aggregates_for_provider [4]
  n-cpu: update_from_provider_tree [5] => set_aggregates_for_provider [6]

  (similar for aggregate_remove_host)

  Whoever gets to set_aggregates_for_provider first will push their view
  of the aggregates to placement.  Until we start checking for
  generation conflicts in set_aggregates_for_provider, whoever gets
  there second will simply blow away the first one.  Therefore it won't
  cause failures and we wouldn't notice.

  Once we do start checking for generation conflicts in
  set_aggregates_for_provider [7], we start seeing actual failures,
  like:

   tempest.api.compute.admin.test_aggregates.AggregatesAdminTestJSON.test_aggregate_add_host_get_details[id-eeef473c-7c52-494d-9f09-2ed7fc8fc036]
   ----------------------------------------------------------------------------------------------------------------------------------------------

   Captured traceback-1:
   ~~~~~~~~~~~~~~~~~~~~~
       Traceback (most recent call last):
         File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
           return func(*args, **kwargs)
         File "tempest/lib/services/compute/aggregates_client.py", line 70, in delete_aggregate
           resp, body = self.delete("os-aggregates/%s" % aggregate_id)
         File "tempest/lib/common/rest_client.py", line 310, in delete
           return self.request('DELETE', url, extra_headers, headers, body)
         File "tempest/lib/services/compute/base_compute_client.py", line 48, in request
           method, url, extra_headers, headers, body, chunked)
         File "tempest/lib/common/rest_client.py", line 668, in request
           self._error_checker(resp, resp_body)
         File "tempest/lib/common/rest_client.py", line 779, in _error_checker
           raise exceptions.BadRequest(resp_body, resp=resp)
       tempest.lib.exceptions.BadRequest: Bad request
       Details: {u'code': 400, u'message': u'Cannot remove host from aggregate 2. Reason: Host aggregate is not empty.'}

  ...

   Captured traceback:
   ~~~~~~~~~~~~~~~~~~~
       Traceback (most recent call last):
         File "tempest/api/compute/admin/test_aggregates.py", line 193, in test_aggregate_add_host_get_details
           self.client.add_host(aggregate['id'], host=self.host)
         File "tempest/lib/services/compute/aggregates_client.py", line 95, in add_host
           post_body)
         File "tempest/lib/common/rest_client.py", line 279, in post
           return self.request('POST', url, extra_headers, headers, body, chunked)
         File "tempest/lib/services/compute/base_compute_client.py", line 48, in request
           method, url, extra_headers, headers, body, chunked)
         File "tempest/lib/common/rest_client.py", line 668, in request
           self._error_checker(resp, resp_body)
         File "tempest/lib/common/rest_client.py", line 845, in _error_checker
           message=message)
       tempest.lib.exceptions.ServerFault: Got server fault
       Details: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
       <class 'nova.exception.ResourceProviderUpdateConflict'>

  [1] https://review.openstack.org/#/c/553597/
  [2] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1935
  [3] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L883
  [4] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1956
  [5] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L897
  [6] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1454
  [7] https://review.openstack.org/#/c/556669/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1779931/+subscriptions


Follow ups