yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80341
[Bug 1844721] Re: Need NUMA aware RAM reservation to avoid OOM killing host processes
Thanks to bug 532168 fix, this issue can be addressed by setting
reserved_huge_pages for small pages per NUMA node in nova.conf, i.e.
configuration change only. Hence close this ticket.
** Changed in: nova
Status: In Progress => New
** Changed in: nova
Status: New => Invalid
** Changed in: nova
Assignee: Jing Zhang (jing.zhang.nokia) => (unassigned)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844721
Title:
Need NUMA aware RAM reservation to avoid OOM killing host processes
Status in OpenStack Compute (nova):
Invalid
Bug description:
Description:
===========
CPU pinning is widely used in VNFs. When VM CPU is pinned, currently
there is no way to reserve memory on NUMA 0 for host processes:
> ram_allocation_ratio is ignored by the nova scheduler when VM CPU is pinned
> reserved_host_memory_mb is a global reservation, as long as there is memory available globally (on any NUMA node) VM is scheduled.
This leads to many VMs are scheduled on NUMA 0 (CPU pinned to NUMA 0)
while their memory needs are met "globally".
When the system starts to take load, VMs' memory start to get
allocated on NUMA 0 (because their are pinned to NUMA 0) to the extend
that memory shortage occurs on NUMA 0 and OOM kicks in to kill host
processes.
Many mitigation are "invented", but those mitigation all have some
form of technical or operational "difficulties". One mitigation, for
example, is to enable huge pages, and put VMs on huge pages.
The right solution is for nova to support NUMA aware RAM reservation
as for the huge pages case, i.e.
reserved_host_memory=node:0, 20G
Steps to reproduce
==================
Create CPU pinned VMs. VMs are crowded on NUMA 0, until no more CPU cores are available on NUMA 0 then they are scheduled on NUMA 1. Stress the system.
Expected result
===============
The system stays operational.
Actual result
=============
OOM kicks to kill host process due to lacking of memory on NUMA 0, while there are tons of memory on NUMA 1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1844721/+subscriptions
References