yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1940668] [NEW] Nova-compute fit instance to numa nodes not optimal resulting instance creation failure

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Ilya Popov <1940668@xxxxxxxxxxxxxxxxxx>
Date: Fri, 20 Aug 2021 14:41:53 -0000
Reply-to: Bug 1940668 <1940668@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========

Reproduced in ussuri, master has the same code.

When nova compute start to fit instance's NUMA topology on host's NUMA
topology it uses host cells list. This list contains cell objects from
cell 0 up to cell N always sorted from cell id 0 up to cell id N (N
number depends on host numa node number). The only case when sort order
of this list is changed is the case with instance without pci device
requirement. If instance doesn't need pci specific to NUMA node, host
cells list is reordered to place cells with PCI capabilities to the end
of list. If all NUMA cells have PCI capabilities, list order won't
changed.

This behaviour leads to attempt to place instance's first NUMA node to
host NUMA node id 0 at the beginning.

If we will use huge pages and place several instances with number of
NUMA nodes less when Host NUMA node number, we exhaust completely NUMA
node id 0. Which will lead to instances with larger number of NUMA nodes
failed to fit on this host (for example instance with NUMA nodes number
equal to host NUMA node number).

To mitigate this issue, it will be better to take into account NUMA node
memory usage.

May be related also to https://bugs.launchpad.net/nova/+bug/1738501

Steps to reproduce
==================

1. Configure OpenStack to use 2MB huge pages, allocate huge pages on compute host(let say compute 1) during boot
For ussuri it is described here: https://docs.openstack.org/nova/ussuri/admin/huge-pages.html

2. Prepare two flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have NUMA node number more than 1.
Flavor's RAM should be large enough to exhaust compute NUMA node 0 RAM with some small number of instances. Lets say with 6 instances of flavor 1 we exhaust compute NUMA node 0 RAM. Flavor 2 RAM should be equal Flavor 1 RAM multiplied by N (number of numa nodes on compute 1.)

3. Start 6 instances with first flavor (with 1 NUMA node defined) on compute 1 (with availability zone hint pointed to compute 1). RAM of NUMA node 0 on host compute 1 will be exhausted
4. Try to start instance with second flavor. Instance will not been started with error "...was re-scheduled: Insufficient compute resources: Requested instance NUMA topology cannot fit the given host NUMA topology"

How it should work:
===================

We should take into account memory usage of numa nodes to reduce number
of this kind of error.

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  
  Reproduced in ussuri, master has the same code.
  
  When nova compute start to fit instance's NUMA topology on host's NUMA
  topology it uses host cells list. This list contains cell objects from
  cell 0 up to cell N always sorted from cell id 0 up to cell id N (N
  number depends on host numa node number). The only case when sort order
  of this list is changed is the case with instance without pci device
  requirement. If instance doesn't need pci specific to NUMA node, host
  cells list is reordered to place cells with PCI capabilities to the end
  of list. If all NUMA cells have PCI capabilities, list order won't
  changed.
  
  This behaviour leads to attempt to place instance's first NUMA node to
  host NUMA node id 0 at the beginning.
  
  If we will use huge pages and place several instances with number of
  NUMA nodes less when Host NUMA node number, we exhaust completely NUMA
  node id 0. Which will lead to instances with larger number of NUMA nodes
  failed to fit on this host (for example instance with NUMA nodes number
  equal to host NUMA node number).
  
  To mitigate this issue, it will be better to take into account NUMA node
  memory usage.
  
  May be related also to https://bugs.launchpad.net/nova/+bug/1738501
  
  Steps to reproduce
  ==================
  
  1. Configure OpenStack to use 2MB huge pages, allocate huge pages on compute host(let say compute 1) during boot
  For ussuri it is described here: https://docs.openstack.org/nova/ussuri/admin/huge-pages.html
  
- 2. Prepare to flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
+ 2. Prepare two flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
  second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have node number more than 1.
  Flavor's RAM should be large enough to exhaust compute NUMA node 0 RAM with some small number of instances. Lets say with 6 instances of flavor 1 we exhaust compute NUMA node 0 RAM. Flavor 2 RAM should be equal Flavor 1 RAM multiplied by N (number of numa nodes on compute 1.)
  
  3. Start 6 instances with first flavor (with 1 NUMA node defined) on compute 1 (with availability zone hint pointed to compute 1). RAM of NUMA node 0 on host compute 1 will be exhausted
  4. Try to start instance with second flavor. Instance will not been started with error "...was re-scheduled: Insufficient compute resources: Requested instance NUMA topology cannot fit the given host NUMA topology"
  
  How it should work:
  ===================
  
  We should take into account memory usage of numa nodes to reduce number
  of this kind of error.

** Description changed:

  Description
  ===========
  
  Reproduced in ussuri, master has the same code.
  
  When nova compute start to fit instance's NUMA topology on host's NUMA
  topology it uses host cells list. This list contains cell objects from
  cell 0 up to cell N always sorted from cell id 0 up to cell id N (N
  number depends on host numa node number). The only case when sort order
  of this list is changed is the case with instance without pci device
  requirement. If instance doesn't need pci specific to NUMA node, host
  cells list is reordered to place cells with PCI capabilities to the end
  of list. If all NUMA cells have PCI capabilities, list order won't
  changed.
  
  This behaviour leads to attempt to place instance's first NUMA node to
  host NUMA node id 0 at the beginning.
  
  If we will use huge pages and place several instances with number of
  NUMA nodes less when Host NUMA node number, we exhaust completely NUMA
  node id 0. Which will lead to instances with larger number of NUMA nodes
  failed to fit on this host (for example instance with NUMA nodes number
  equal to host NUMA node number).
  
  To mitigate this issue, it will be better to take into account NUMA node
  memory usage.
  
  May be related also to https://bugs.launchpad.net/nova/+bug/1738501
  
  Steps to reproduce
  ==================
  
  1. Configure OpenStack to use 2MB huge pages, allocate huge pages on compute host(let say compute 1) during boot
  For ussuri it is described here: https://docs.openstack.org/nova/ussuri/admin/huge-pages.html
  
  2. Prepare two flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
- second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have node number more than 1.
+ second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have NUMA node number more than 1.
  Flavor's RAM should be large enough to exhaust compute NUMA node 0 RAM with some small number of instances. Lets say with 6 instances of flavor 1 we exhaust compute NUMA node 0 RAM. Flavor 2 RAM should be equal Flavor 1 RAM multiplied by N (number of numa nodes on compute 1.)
  
  3. Start 6 instances with first flavor (with 1 NUMA node defined) on compute 1 (with availability zone hint pointed to compute 1). RAM of NUMA node 0 on host compute 1 will be exhausted
  4. Try to start instance with second flavor. Instance will not been started with error "...was re-scheduled: Insufficient compute resources: Requested instance NUMA topology cannot fit the given host NUMA topology"
  
  How it should work:
  ===================
  
  We should take into account memory usage of numa nodes to reduce number
  of this kind of error.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1940668

Title:
  Nova-compute fit instance to numa nodes not optimal resulting instance
  creation failure

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Reproduced in ussuri, master has the same code.

  When nova compute start to fit instance's NUMA topology on host's NUMA
  topology it uses host cells list. This list contains cell objects from
  cell 0 up to cell N always sorted from cell id 0 up to cell id N (N
  number depends on host numa node number). The only case when sort
  order of this list is changed is the case with instance without pci
  device requirement. If instance doesn't need pci specific to NUMA
  node, host cells list is reordered to place cells with PCI
  capabilities to the end of list. If all NUMA cells have PCI
  capabilities, list order won't changed.

  This behaviour leads to attempt to place instance's first NUMA node to
  host NUMA node id 0 at the beginning.

  If we will use huge pages and place several instances with number of
  NUMA nodes less when Host NUMA node number, we exhaust completely NUMA
  node id 0. Which will lead to instances with larger number of NUMA
  nodes failed to fit on this host (for example instance with NUMA nodes
  number equal to host NUMA node number).

  To mitigate this issue, it will be better to take into account NUMA
  node memory usage.

  May be related also to https://bugs.launchpad.net/nova/+bug/1738501

  Steps to reproduce
  ==================

  1. Configure OpenStack to use 2MB huge pages, allocate huge pages on compute host(let say compute 1) during boot
  For ussuri it is described here: https://docs.openstack.org/nova/ussuri/admin/huge-pages.html

  2. Prepare two flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
  second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have NUMA node number more than 1.
  Flavor's RAM should be large enough to exhaust compute NUMA node 0 RAM with some small number of instances. Lets say with 6 instances of flavor 1 we exhaust compute NUMA node 0 RAM. Flavor 2 RAM should be equal Flavor 1 RAM multiplied by N (number of numa nodes on compute 1.)

  3. Start 6 instances with first flavor (with 1 NUMA node defined) on compute 1 (with availability zone hint pointed to compute 1). RAM of NUMA node 0 on host compute 1 will be exhausted
  4. Try to start instance with second flavor. Instance will not been started with error "...was re-scheduled: Insufficient compute resources: Requested instance NUMA topology cannot fit the given host NUMA topology"

  How it should work:
  ===================

  We should take into account memory usage of numa nodes to reduce
  number of this kind of error.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1940668/+subscriptions