← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1978372] Re: numa_fit_instance_to_host() algorithm is highly ineffective on higher number of NUMA nodes

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/845896
Committed: https://opendev.org/openstack/nova/commit/099a6f63af7805440d91976ba0ea03bc6c278280
Submitter: "Zuul (22348)"
Branch:    master

commit 099a6f63af7805440d91976ba0ea03bc6c278280
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date:   Wed Jun 15 09:28:27 2022 +0200

    Optimize numa_fit_instance_to_host
    
    The  numa_fit_instance_to_host algorithm tries all the possible
    host cell permutations to fit the instance cells. So in worst case
    scenario it does  n! / (n-k)! _numa_fit_instance_cell calls
    (n=len(host_cells) k=len(instance_cells)) to find if the instance can be
    fit to the host. With 16 NUMA nodes host and 8 NUMA node guests this
    means 500 million calls to _numa_fit_instance_cell. This takes excessive
    time.
    
    However going through these permutations there are many repetitive
    host_cell, instance_cell pairs to try to fit.
    E.g.
      host_cells=[H1, H2, H2]
      instance_cells=[G1, G2]
    
    Produces pairings:
    
    * H1 <- G1 and H2 <- G2
    * H1 <- G1 and H3 <- G2
    ...
    
    Here G1 is checked to fit H1 twice. But if it does not fit in the first
    time then we know that it will not fit in the second time either. So we
    can cache the result of the first check and use that cache for the later
    permutations.
    
    This patch adds two caches to the algo. A fit_cache to hold
    host_cell.id, instance_cell.id pairs that we know fit, and a
    no_fit_cache for those pairs that we already know that doesn't fit.
    
    This change significantly boost the performance of the algorithm. The
    reproduction provided in the bug 1978372 took 6 minutes on my local
    machine to run without the optimization. With the optimization it run in
    3 seconds.
    
    This change increase the memory usage of the algorithm with the two
    caches. Those caches are sets of integer two tuples. And the total size
    of the cache is the total number of possible host_cell, instance_cell
    pairs which is len(host_cell) * len(instance_cells). So form the above
    example (16 host, 8 instance NUMA) it is 128 pairs of integers in the
    cache. That will not cause a significant memory increase.
    
    Closes-Bug: #1978372
    Change-Id: Ibcf27d741429a239d13f0404348c61e2668b4ce4


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1978372

Title:
  numa_fit_instance_to_host() algorithm is highly ineffective on higher
  number of NUMA nodes

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  Nova scheduler, when numa_fit_instance_to_host() is executed for instance with 8 NUMA nodes against host object with NUMA topology that includes 16 NUMA nodes (3 cores × 2 threads each) is taking ~5 minutes when first half of NUMA nodes are occupied.

  This makes scheduling 48 cores flavor extremely sloooow…

  Output of reproducer:
  ```
  InstanceNUMATopology(cells=[InstanceNUMACell(8),InstanceNUMACell(9),InstanceNUMACell(10),InstanceNUMACell(11),InstanceNUMACell(12),InstanceNUMACell(13),InstanceNUMACell(14)],emulator_threads_policy=None,id=<?>,instance_uuid=<?>)

  ________________________________________________________
  Executed in  269.13 secs    fish           external
     usr time  268.60 secs    0.00 micros  268.60 secs
     sys time    0.07 secs  595.00 micros    0.07 secs
  ```

  Steps to reproduce
  ==================
  1. Add host with 16 NUMA nodes (3 cores × 2 threads each) to the OpenStack
  2. Create a flavor for 48 CPUs that would take half of the host exactly
  openstack flavor create sh4a-c48r488e20 \
  --ram $((488*1024)) \
  --vcpus 48 \
  --ephemeral 20 \
  --disk 20 \
  --swap 0 \
  --property 'hw:mem_page_size=1GB' \
  --property 'hw:cpu_policy=dedicated' \
  --property 'hw:cpu_thread_policy=prefer' \
  --property 'hw:cpu_max_sockets=8' \
  --property 'hw:cpu_sockets=8' \
  --property 'hw:numa_mempolicy=strict' \
  --property 'hw:numa_nodes=8' \
  --property 'hw:numa_cpus.0=0,1,2,3,4,5' \
  --property 'hw:numa_cpus.1=6,7,8,9,10,11' \
  --property 'hw:numa_cpus.2=12,13,14,15,16,17' \
  --property 'hw:numa_cpus.3=18,19,20,21,22,23' \
  --property 'hw:numa_cpus.4=24,25,26,27,28,29' \
  --property 'hw:numa_cpus.5=30,31,32,33,34,35' \
  --property 'hw:numa_cpus.6=36,37,38,39,40,41' \
  --property 'hw:numa_cpus.7=42,43,44,45,46,47' \
  --property 'hw:numa_mem.0=62464' \
  --property 'hw:numa_mem.1=62464' \
  --property 'hw:numa_mem.2=62464' \
  --property 'hw:numa_mem.3=62464' \
  --property 'hw:numa_mem.4=62464' \
  --property 'hw:numa_mem.5=62464' \
  --property 'hw:numa_mem.6=62464' \
  --property 'hw:numa_mem.7=62464' \
  --property 'hw:cpu_threads=2' \
  --property 'hw:cpu_max_threads=2'
  3. Create an instance with such flavor (so that it would normally land to that host) - command is skipped as in different installation it could be different
  4. Wait for the first instance to spawn (this part is fast as it takes first 8 NUMA nodes).
  5. Create a second instance with the same flavor.

  …

  Wait 5+ minutes until nova-scheduler is done with its work.

  Expected result
  ===============
  NUMA nodes selected within 10-15 seconds.

  Actual result
  =============
  Algorithm is slow enough so that it takes 5 minutes to have instance scheduled.

  Environment
  ===========
  1. OpenStack Nova 23.2.0-1.el8. NOTE: I am able to reproduce this with master branch with 20 lines reproducer.
  commit 4939318649650b60dd07d161b80909e70d0e093e (HEAD -> master, upstream/master)
  Merge: c6e0f4f551 4c339c10e3
  Author: Zuul <zuul@xxxxxxxxxxxxxxxxxx>
  Date:   Tue May 17 00:01:41 2022 +0000

      Merge "Drop lower-constraints.txt and its testing"

  2. Libvirt + KVM (although it is not relevant here)
  libvirt-8.0.0-6.module_el8.7.0+1140+ff0772f9.x86_64
  qemu-kvm-6.2.0-12.module_el8.7.0+1140+ff0772f9.x86_64

  2. LVM storage (not relevant either)
  lvm2-2.03.14-3.el8.x86_64

  3. Neutron with L2 (not relevant)

  Logs & Configs
  ==============
  Check the reproducer and try it with uncommented DEBUG lines (will attach it here too).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1978372/+subscriptions



References