yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1816454] Re: hw:mem_page_size is not respecting all documented values

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Stephen Finucane <sfinucan@xxxxxxxxxx>
Date: Fri, 13 Mar 2020 11:33:16 -0000
Reply-to: Bug 1816454 <1816454@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Looks like this was resolved in https://review.opendev.org/#/c/673252/

** Changed in: nova
       Status: New => Fix Released

** Tags added: doc

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816454

Title:
  hw:mem_page_size is not respecting all documented values

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Per the Rocky documentation for hugepages:
  https://docs.openstack.org/nova/rocky/admin/huge-pages.html

  2MB hugepages can be specified either as:
  --property hw:mem_page_size=2Mb, or
  --property hw:mem_page_size=2048

  However, whenever I use the former notation (2Mb), conductor fails
  with the misleading NUMA error below... whereas with the latter
  notation (2048), allocation succeeds and the resulting instance is
  backed with 2MB hugepages on an x86_64 platform (as verified by
  checking `/proc/meminfo | grep HugePages_Free` before/after stopping
  the created instance).

  ERROR nova.scheduler.utils [req-de6920d5-829b-411c-acd7-1343f48824c9
  cb2abbb91da54209a5ad93a845b4cc26 cb226ff7932d40b0a48ec129e162a2fb -
  default default] [instance: 5b53d1d4-6a16-4db9-ab52-b267551c6528]
  Error from last host: node1 (node FQDN-REDACTED): ['Traceback (most
  recent call last):\n', '  File "/usr/lib/python3/dist-
  packages/nova/compute/manager.py", line 2106, in
  _build_and_run_instance\n    with rt.instance_claim(context, instance,
  node, limits):\n', '  File "/usr/lib/python3/dist-
  packages/oslo_concurrency/lockutils.py", line 274, in inner\n
  return f(*args, **kwargs)\n', '  File "/usr/lib/python3/dist-
  packages/nova/compute/resource_tracker.py", line 217, in
  instance_claim\n    pci_requests, overhead=overhead,
  limits=limits)\n', '  File "/usr/lib/python3/dist-
  packages/nova/compute/claims.py", line 95, in __init__\n
  self._claim_test(resources, limits)\n', '  File "/usr/lib/python3
  /dist-packages/nova/compute/claims.py", line 162, in _claim_test\n
  "; ".join(reasons))\n', 'nova.exception.ComputeResourcesUnavailable:
  Insufficient compute resources: Requested instance NUMA topology
  cannot fit the given host NUMA topology.\n', '\nDuring handling of the
  above exception, another exception occurred:\n\n', 'Traceback (most
  recent call last):\n', '  File "/usr/lib/python3/dist-
  packages/nova/compute/manager.py", line 1940, in
  _do_build_and_run_instance\n    filter_properties, request_spec)\n', '
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line
  2156, in _build_and_run_instance\n    instance_uuid=instance.uuid,
  reason=e.format_message())\n', 'nova.exception.RescheduledException:
  Build of instance 5b53d1d4-6a16-4db9-ab52-b267551c6528 was re-
  scheduled: Insufficient compute resources: Requested instance NUMA
  topology cannot fit the given host NUMA topology.\n']

  Additional info:
  I am using Debian testing (buster) and all OpenStack packages included therein.

  $ dpkg -l | grep nova
  ii  nova-common                           2:18.1.0-2                              all          OpenStack Compute - common files
  ii  nova-compute                          2:18.1.0-2                              all          OpenStack Compute - compute node
  ii  nova-compute-kvm                      2:18.1.0-2                              all          OpenStack Compute - compute node (KVM)
  ii  python3-nova                          2:18.1.0-2                              all          OpenStack Compute - libraries
  ii  python3-novaclient                    2:11.0.0-2                              all          client library for OpenStack Compute API - 3.x

  $ dpkg -l | grep qemu
  ii  ipxe-qemu                             1.0.0+git-20161027.b991c67-1            all          PXE boot firmware - ROM images for qemu
  ii  qemu-block-extra:amd64                1:3.1+dfsg-2+b1                         amd64        extra block backend modules for qemu-system and qemu-utils
  ii  qemu-kvm                              1:3.1+dfsg-2+b1                         amd64        QEMU Full virtualization on x86 hardware
  ii  qemu-system-common                    1:3.1+dfsg-2+b1                         amd64        QEMU full system emulation binaries (common files)
  ii  qemu-system-data                      1:3.1+dfsg-2                            all          QEMU full system emulation (data files)
  ii  qemu-system-gui                       1:3.1+dfsg-2+b1                         amd64        QEMU full system emulation binaries (user interface and audio support)
  ii  qemu-system-x86                       1:3.1+dfsg-2+b1                         amd64        QEMU full system emulation binaries (x86)
  ii  qemu-utils                            1:3.1+dfsg-2+b1                         amd64        QEMU utilities

  * I forced nova to allocate on the same hypervisor (node1) when
  checking for the issue and can repeatedly allocate using a flavor
  which specifies hugepages with hw:mem_page_size=2048 -- on the
  contrary, when using a flavor which is otherwise unchanged except for
  the 2048/2Mb difference, allocation repeatedly fails.

  * I am using libvirt+kvm.  I don't think it matters, but I am using
  Ceph as a storage backend and neutron in a very basic VLAN-based
  segmentation configuration (no OVS or anything remotely fancy).

  * I specified hw:numa_nodes='1' when creating the flavor... and all my
  hypervisors only have 1 NUMA node, so allocation should always succeed
  as long as there are free huge pages (which there are).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1816454/+subscriptions
References

[Bug 1816454] [NEW] hw:mem_page_size is not respecting all documented values
From: Tyler Stachecki, 2019-02-18