← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1887377] [NEW] nova does not loadbalance asignmnet of resources on a host based on avaiablity of pci device, hugepages or pcpus.

 

Public bug reported:

Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long
time. since its introduction the advice has always been to create a flavor that mimic your
typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes the you should create
flavor that request 2 numa nodes. for along time operators have ignored this advice
and continued to create singel numa node flavor sighting that after 5+ year of hardware venders
working with VNF vendor to make there product numa aware, vnf often still do not optimize
properly for a multi numa environment.

as a result many operator still deploy single numa vms although that is
becoming less common over time.  when you deploy a vm with a single numa
node today we more or less iterate over the host numa node in order and
assign the vm to the first numa nodes where it fits. on a host without
any pci devices whitelisted for openstack management this behvaior
result in numa nodes being filled linerally form numa 0 to numa n. that
mean if a host had 100G of hugepage on both numa node 0 and 1 and you
schduled 101 1G singel numa vms to the host, 100 vm would spawn on numa0
and 1 vm would spwan on numa node 1.

that means that the first 100 vms would all contened for cpu resouces on
the first numa node while the last vm had all of the secound numa ndoe
to its own use.

the correct behavior woudl be for nova to round robin asign the vms
attepmetin to keep the resouce avapiableity  blanced. this will
maxiumise performance for indivigual vms while pessimisng the schduling
of large vms on a host.

to this end a new numa blancing config option (unset, pack or spread)
should be added and we should sort numa nodes in decending(spread) or
acending(pack) order based on pMEM, pCPUs, mempages and pci devices in
that sequence.

in future release when numa is in placment this sorting will need to be
done in a weigher that sorts the allocation caindiates based on the same
pack/spread cirtira.

i am filing this as a bug not a feature as this will have a significant
impact for existing deployment that either expected
https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented
/reserve-numa-with-pci.html to implement this logic already or who do
not follow our existing guidance on creating flavor that align to the
host topology.

** Affects: nova
     Importance: Undecided
     Assignee: sean mooney (sean-k-mooney)
         Status: New


** Tags: numa

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1887377

Title:
  nova does not loadbalance asignmnet of resources on a host based on
  avaiablity of pci device, hugepages or pcpus.

Status in OpenStack Compute (nova):
  New

Bug description:
  Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long
  time. since its introduction the advice has always been to create a flavor that mimic your
  typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes the you should create
  flavor that request 2 numa nodes. for along time operators have ignored this advice
  and continued to create singel numa node flavor sighting that after 5+ year of hardware venders
  working with VNF vendor to make there product numa aware, vnf often still do not optimize
  properly for a multi numa environment.

  as a result many operator still deploy single numa vms although that
  is becoming less common over time.  when you deploy a vm with a single
  numa node today we more or less iterate over the host numa node in
  order and assign the vm to the first numa nodes where it fits. on a
  host without any pci devices whitelisted for openstack management this
  behvaior result in numa nodes being filled linerally form numa 0 to
  numa n. that mean if a host had 100G of hugepage on both numa node 0
  and 1 and you schduled 101 1G singel numa vms to the host, 100 vm
  would spawn on numa0 and 1 vm would spwan on numa node 1.

  that means that the first 100 vms would all contened for cpu resouces
  on the first numa node while the last vm had all of the secound numa
  ndoe to its own use.

  the correct behavior woudl be for nova to round robin asign the vms
  attepmetin to keep the resouce avapiableity  blanced. this will
  maxiumise performance for indivigual vms while pessimisng the
  schduling of large vms on a host.

  to this end a new numa blancing config option (unset, pack or spread)
  should be added and we should sort numa nodes in decending(spread) or
  acending(pack) order based on pMEM, pCPUs, mempages and pci devices in
  that sequence.

  in future release when numa is in placment this sorting will need to
  be done in a weigher that sorts the allocation caindiates based on the
  same pack/spread cirtira.

  i am filing this as a bug not a feature as this will have a
  significant impact for existing deployment that either expected
  https://specs.openstack.org/openstack/nova-
  specs/specs/pike/implemented/reserve-numa-with-pci.html to implement
  this logic already or who do not follow our existing guidance on
  creating flavor that align to the host topology.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1887377/+subscriptions


Follow ups