← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1844721] Re: Need NUMA aware RAM reservation to avoid OOM killing host processes

 

Thanks to bug 532168 fix, this issue can be addressed by setting
reserved_huge_pages for small pages per NUMA node in nova.conf, i.e.
configuration change only. Hence close this ticket.

** Changed in: nova
       Status: In Progress => New

** Changed in: nova
       Status: New => Invalid

** Changed in: nova
     Assignee: Jing Zhang (jing.zhang.nokia) => (unassigned)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844721

Title:
  Need NUMA aware RAM reservation to avoid OOM killing host processes

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description:
  ===========

  CPU pinning is widely used in VNFs. When VM CPU is pinned, currently
  there is no way to reserve memory on NUMA 0 for host processes:

  > ram_allocation_ratio is ignored by the nova scheduler when VM CPU is pinned
  > reserved_host_memory_mb is a global reservation, as long as there is memory available globally (on any NUMA node) VM is scheduled.

  This leads to many VMs are scheduled on NUMA 0 (CPU pinned to NUMA 0)
  while their memory needs are met "globally".

  When the system starts to take load, VMs' memory start to get
  allocated on NUMA 0 (because their are pinned to NUMA 0) to the extend
  that memory shortage occurs on NUMA 0 and OOM kicks in to kill host
  processes.

  Many mitigation are "invented", but those mitigation all have some
  form of technical or operational "difficulties". One mitigation, for
  example, is to enable huge pages, and put VMs on huge pages.

  The right solution is for nova to support NUMA aware RAM reservation
  as for the huge pages case, i.e.

  reserved_host_memory=node:0, 20G

  
  Steps to reproduce
  ==================
  Create CPU pinned VMs. VMs are crowded on NUMA 0, until no more CPU cores are available on NUMA 0 then they are scheduled on NUMA 1. Stress the system.

  Expected result
  ===============

  The system stays operational.

  Actual result
  =============
  OOM kicks to kill host process due to lacking of memory on NUMA 0, while there are tons of memory on NUMA 1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1844721/+subscriptions


References