yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1893121] [NEW] nova does not balance vm across numa node or prefer numa node with pci device when one is requested

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: sean mooney <1893121@xxxxxxxxxxxxxxxxxx>
Date: Wed, 26 Aug 2020 18:54:16 -0000
Reply-to: Bug 1893121 <1893121@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

the current implementation of numa has evolved over the years to support
pci affinity policyes and numa affinity for other deivces like pmem.

when numa was first intoduced the recomenation was to match the virtual
numa toplogy of a guest to the numa toplogy of the host for best
performance.

in such a configuration the guest cpus and memory are evenly distubuted across the host numa nodes
meanin that the memroy contoler and phsyicall cpus are consumed evenly.
i.e. all vms do not use the cores form oh host numa node.

if you create a vm with only hw:numa_nodes set and no other numa
requests however due to how we currently iterage over host numa cells in
a deterministic order the all vms will be placeed on numa node 0.

if other vms also request numa resource like pinned cpus hw:cpu_policy=dedicated or explict pages size hw:mem_page_size<small|large|any|###> then the consumption of those resource will eventually
cause those vms to loadblance onto the other numa nodes.

as a reuslt the current behavior is to fill the first numa node before
ever using resouces form the rest for numa vms using cpu pinnign or
hugepages but numa vms that only request hw:numa_nodes wont be
loadblanced.

in both case this is suboptimal as it resulting in lower utilisation of the host hardware as
the second and subsequent numa nodes will not be used untill the first numa node is full when using pinning and huge pages and will never be used for numa instance that dont request other numa resources.

in a similar vain

https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/reserve-numa-with-pci.html
partly implemented a preferential sorting of host with pci devices.

if the vm  did not request a pci device we weighter host with pcidevice
lowere vs host without them.

https://github.com/openstack/nova/blob/20459e3e88cb8382d450c7fdb042e2016d5560c5/nova/virt/hardware.py#L2268-L2275


a full implemation would have on the selected host prefer putting the vms on numa nodes that had a pci device.

as a result if a host has 2 numa nodes and the vm request a pci deice and 1 numa node.
if the vm will fit on the first numa node (node0) and has the prefered police for pci affintiy we wont check or use the second numa node (node1).


the fix for this is triviail add an else clause

 # If PCI device(s) are not required, prefer host cells that don't have
    # devices attached. Presence of a given numa_node in a PCI pool is
    # indicative of a PCI device being associated with that node
    if not pci_requests and pci_stats:
        # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
        # that out for some reason
        host_cells = sorted(host_cells, key=lambda cell: cell.id in [
            pool['numa_node'] for pool in pci_stats.pools])  # type: ignore

becomes

 # If PCI device(s) are not required, prefer host cells that don't have
    # devices attached. Presence of a given numa_node in a PCI pool is
    # indicative of a PCI device being associated with that node
    if not pci_requests and pci_stats:
        # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
        # that out for some reason
        host_cells = sorted(host_cells, key=lambda cell: cell.id in [
            pool['numa_node'] for pool in pci_stats.pools])  # type: ignore
     else:
        host_cells = sorted(host_cells, key=lambda cell: cell.id in [
            pool['numa_node'] for pool in pci_stats.pools], reverse=True)  # type: ignore
 
or more compactly


 # If PCI device(s) are not required, prefer host cells that don't have
    # devices attached. Presence of a given numa_node in a PCI pool is
    # indicative of a PCI device being associated with that node
    reverse =  pci_requests and pci_stats:
    # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
    # that out for some reason
    host_cells = sorted(host_cells, key=lambda cell: cell.id in [
            pool['numa_node'] for pool in pci_stats.pools], reverse=reverse)  # type: ignore
    

since python support stable sort orders complex sort can be achcive by
multiple stables sorts

https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-
sorts

so we can also adress the numa blanceing issue by 
first sorting by instance per numa node
then sorting by free memory per numa node
then by cpus per numa node and finally 
by pci device per numa node.

this will allow nova to evenly distubtue vms optimally per numa node and
also fully support the preference aspect of the preferred sriov numa
affinity policy which currenlty only select a host that is capable of
provideing numa affintiy but does not actully pferfer the numa node when
we boot the vm.

this bug applies to all currently supported release of nova.

** Affects: nova
     Importance: Undecided
     Assignee: sean mooney (sean-k-mooney)
         Status: Confirmed


** Tags: libvirt numa pci scheduler

** Summary changed:

- nova does not blance vm aross numa node or prefer numa node with pci device when one is requested
+ nova does not balance vm across numa node or prefer numa node with pci device when one is requested

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1893121

Title:
  nova does not balance vm across numa node or prefer numa node with pci
  device when one is requested

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  the current implementation of numa has evolved over the years to
  support pci affinity policyes and numa affinity for other deivces like
  pmem.

  when numa was first intoduced the recomenation was to match the
  virtual numa toplogy of a guest to the numa toplogy of the host for
  best performance.

  in such a configuration the guest cpus and memory are evenly distubuted across the host numa nodes
  meanin that the memroy contoler and phsyicall cpus are consumed evenly.
  i.e. all vms do not use the cores form oh host numa node.

  if you create a vm with only hw:numa_nodes set and no other numa
  requests however due to how we currently iterage over host numa cells
  in a deterministic order the all vms will be placeed on numa node 0.

  if other vms also request numa resource like pinned cpus hw:cpu_policy=dedicated or explict pages size hw:mem_page_size<small|large|any|###> then the consumption of those resource will eventually
  cause those vms to loadblance onto the other numa nodes.

  as a reuslt the current behavior is to fill the first numa node before
  ever using resouces form the rest for numa vms using cpu pinnign or
  hugepages but numa vms that only request hw:numa_nodes wont be
  loadblanced.

  in both case this is suboptimal as it resulting in lower utilisation of the host hardware as
  the second and subsequent numa nodes will not be used untill the first numa node is full when using pinning and huge pages and will never be used for numa instance that dont request other numa resources.

  in a similar vain

  https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/reserve-numa-with-pci.html
  partly implemented a preferential sorting of host with pci devices.

  if the vm  did not request a pci device we weighter host with
  pcidevice lowere vs host without them.

  https://github.com/openstack/nova/blob/20459e3e88cb8382d450c7fdb042e2016d5560c5/nova/virt/hardware.py#L2268-L2275

  
  a full implemation would have on the selected host prefer putting the vms on numa nodes that had a pci device.

  as a result if a host has 2 numa nodes and the vm request a pci deice and 1 numa node.
  if the vm will fit on the first numa node (node0) and has the prefered police for pci affintiy we wont check or use the second numa node (node1).

  
  the fix for this is triviail add an else clause

   # If PCI device(s) are not required, prefer host cells that don't have
      # devices attached. Presence of a given numa_node in a PCI pool is
      # indicative of a PCI device being associated with that node
      if not pci_requests and pci_stats:
          # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
          # that out for some reason
          host_cells = sorted(host_cells, key=lambda cell: cell.id in [
              pool['numa_node'] for pool in pci_stats.pools])  # type: ignore

  becomes

   # If PCI device(s) are not required, prefer host cells that don't have
      # devices attached. Presence of a given numa_node in a PCI pool is
      # indicative of a PCI device being associated with that node
      if not pci_requests and pci_stats:
          # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
          # that out for some reason
          host_cells = sorted(host_cells, key=lambda cell: cell.id in [
              pool['numa_node'] for pool in pci_stats.pools])  # type: ignore
       else:
          host_cells = sorted(host_cells, key=lambda cell: cell.id in [
              pool['numa_node'] for pool in pci_stats.pools], reverse=True)  # type: ignore
   
  or more compactly

  
   # If PCI device(s) are not required, prefer host cells that don't have
      # devices attached. Presence of a given numa_node in a PCI pool is
      # indicative of a PCI device being associated with that node
      reverse =  pci_requests and pci_stats:
      # TODO(stephenfin): pci_stats can't be None here but mypy can't figure
      # that out for some reason
      host_cells = sorted(host_cells, key=lambda cell: cell.id in [
              pool['numa_node'] for pool in pci_stats.pools], reverse=reverse)  # type: ignore
      

  since python support stable sort orders complex sort can be achcive by
  multiple stables sorts

  https://docs.python.org/3/howto/sorting.html#sort-stability-and-
  complex-sorts

  so we can also adress the numa blanceing issue by 
  first sorting by instance per numa node
  then sorting by free memory per numa node
  then by cpus per numa node and finally 
  by pci device per numa node.

  this will allow nova to evenly distubtue vms optimally per numa node
  and also fully support the preference aspect of the preferred sriov
  numa affinity policy which currenlty only select a host that is
  capable of provideing numa affintiy but does not actully pferfer the
  numa node when we boot the vm.

  this bug applies to all currently supported release of nova.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1893121/+subscriptions
Follow ups

[Bug 1893121] Re: nova does not balance vm across numa node or prefer numa node with pci device when one is requested
From: sean mooney, 2022-06-14