yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1786055] [NEW] performance degradation in placement with large number of resource providers

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Chris Dent <cdent+lp@xxxxxxxxxxxxx>
Date: Wed, 08 Aug 2018 14:50:25 -0000
Reply-to: Bug 1786055 <1786055@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Using today's master, there is a big performance degradation in GET
/allocation_candidates when there is a large number of resource
providers (in my tests 1000, each with the same inventory as described
in [1]). 17s when querying all three resource classes with
http://127.0.0.1:8081/allocation_candidates?resources=VCPU:1,MEMORY_MB:256,DISK_GB:10

Using a limit does not make any difference, the cost is in generating
the original data.

I did some advanced LOG.debug based benchmarking to determine three
places where things are a problem, and maybe even fixed the worst one.
See the diff below. The two main culprits are
ResourceProvider.get_by_uuid calls looping over the full set. These can
be replaced by either using data we already have from early queries, or
by changing so we are making single queries.

In the diff I've already changed one of them (the second chunk) to use
the data that _build_provider_summaries is already getting. (functional
tests still pass with this change)

The third chunk is because we have a big loop, but I suspect there is
some duplication that can be avoided. I have no investigated that
closely (yet).

-=-=-
diff --git a/nova/api/openstack/placement/objects/resource_provider.py b/nova/api/openstack/placement/objects/resource_provider.py
index 851f9719e4..e6c894b8fe 100644
--- a/nova/api/openstack/placement/objects/resource_provider.py
+++ b/nova/api/openstack/placement/objects/resource_provider.py
@@ -3233,6 +3233,8 @@ def _build_provider_summaries(context, usages, prov_traits):
         if not summary:
             summary = ProviderSummary(
                 context,
+                # This is _expensive_ when there are a large number of rps.
+                # Building the objects differently may be better.
                 resource_provider=ResourceProvider.get_by_uuid(context,
                                                                uuid=rp_uuid),
                 resources=[],
@@ -3519,8 +3521,7 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
         rp_uuid = rp_summary.resource_provider.uuid
         tree_dict[root_id][rc_id].append(
             AllocationRequestResource(
-                ctx, resource_provider=ResourceProvider.get_by_uuid(ctx,
-                                                                    rp_uuid),
+                ctx, resource_provider=rp_summary.resource_provider,
                 resource_class=_RC_CACHE.string_from_id(rc_id),
                 amount=requested_resources[rc_id]))
 
@@ -3535,6 +3536,8 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
     alloc_prov_ids = []
 
     # Let's look into each tree
+    # With many resource providers this takes a long time, but each trip
+    # through the loop is not too bad.
     for root_id, alloc_dict in tree_dict.items():
         # Get request_groups, which is a list of lists of
         # AllocationRequestResource(ARR) per requested resource class(rc).
-=-=-


[1]
https://github.com/cdent/placeload/blob/master/placeload/__init__.py#L23

** Affects: nova
     Importance: High
         Status: Confirmed


** Tags: placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1786055

Title:
  performance degradation in placement with large number of resource
  providers

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  Using today's master, there is a big performance degradation in GET
  /allocation_candidates when there is a large number of resource
  providers (in my tests 1000, each with the same inventory as described
  in [1]). 17s when querying all three resource classes with
  http://127.0.0.1:8081/allocation_candidates?resources=VCPU:1,MEMORY_MB:256,DISK_GB:10

  Using a limit does not make any difference, the cost is in generating
  the original data.

  I did some advanced LOG.debug based benchmarking to determine three
  places where things are a problem, and maybe even fixed the worst one.
  See the diff below. The two main culprits are
  ResourceProvider.get_by_uuid calls looping over the full set. These
  can be replaced by either using data we already have from early
  queries, or by changing so we are making single queries.

  In the diff I've already changed one of them (the second chunk) to use
  the data that _build_provider_summaries is already getting.
  (functional tests still pass with this change)

  The third chunk is because we have a big loop, but I suspect there is
  some duplication that can be avoided. I have no investigated that
  closely (yet).

  -=-=-
  diff --git a/nova/api/openstack/placement/objects/resource_provider.py b/nova/api/openstack/placement/objects/resource_provider.py
  index 851f9719e4..e6c894b8fe 100644
  --- a/nova/api/openstack/placement/objects/resource_provider.py
  +++ b/nova/api/openstack/placement/objects/resource_provider.py
  @@ -3233,6 +3233,8 @@ def _build_provider_summaries(context, usages, prov_traits):
           if not summary:
               summary = ProviderSummary(
                   context,
  +                # This is _expensive_ when there are a large number of rps.
  +                # Building the objects differently may be better.
                   resource_provider=ResourceProvider.get_by_uuid(context,
                                                                  uuid=rp_uuid),
                   resources=[],
  @@ -3519,8 +3521,7 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
           rp_uuid = rp_summary.resource_provider.uuid
           tree_dict[root_id][rc_id].append(
               AllocationRequestResource(
  -                ctx, resource_provider=ResourceProvider.get_by_uuid(ctx,
  -                                                                    rp_uuid),
  +                ctx, resource_provider=rp_summary.resource_provider,
                   resource_class=_RC_CACHE.string_from_id(rc_id),
                   amount=requested_resources[rc_id]))
   
  @@ -3535,6 +3536,8 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
       alloc_prov_ids = []
   
       # Let's look into each tree
  +    # With many resource providers this takes a long time, but each trip
  +    # through the loop is not too bad.
       for root_id, alloc_dict in tree_dict.items():
           # Get request_groups, which is a list of lists of
           # AllocationRequestResource(ARR) per requested resource class(rc).
  -=-=-


  
  [1] https://github.com/cdent/placeload/blob/master/placeload/__init__.py#L23

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1786055/+subscriptions
Follow ups

[Bug 1786055] Re: performance degradation in placement with large number of resource providers
From: melanie witt, 2018-08-09