← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2015892] [NEW] Unused pre-existing mediated devices not "available" on instance creation

 

Public bug reported:

Description
===========

We are running Yoga, deployed via Kolla-Ansible, on a number of compute
hosts equipped with NVIDIA A100s. Since deploying images recently in
order to pull in this fix
(https://review.opendev.org/c/openstack/nova/+/866154), nova refuses to
consider pre-existing mediated devices on instance creation, even though
they are not in use by any instance. This happens both if we pre-create
the mdevs (in an attempt to make their UUIDs persistent) and also if we
leave the creation of mediated devices to nova. In the latter case,
instance (and mdev) creation succeeds, but medvs are not cleaned up when
instances are destroyed and after a while the hardware is filled up with
unused mdevs left behind by past instances and new instances fail to be
created.

Steps to reproduce
==================

1. Configure nova according to the "Attaching virtual GPUs devices to guests" guide (https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
2. Pre-create mediated devices to fill up all available instances
3. Create an instance with a vGPU attached

Expected result
===============

Instance is created with vGPU attached.

Actual result
=============

Instance creation fails with "Error: Exceeded maximum number of retries.
Exhausted all hosts available for retrying build failures for instance".
Nova logs show:

2023-04-11 17:37:40.156 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Available mdevs at: set().
2023-04-11 17:37:40.253 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_01_02_1', 'pci_0000_a1_01_4', 'pci_0000_81_01_2', 'pci_0000_a1_01_3', 'pci_0000_81_02_5', 'pci_0000_81_02_2', 'pci_0000_01_01_2', 'pci_0000_81_01_5', 'pci_0000_81_00_4', 'pci_0000_01_02_0', 'pci_0000_a1_00_5', 'pci_0000_a1_01_0', 'pci_0000_01_02_4', 'pci_0000_81_02_4', 'pci_0000_a1_02_4', 'pci_0000_81_01_4', 'pci_0000_81_02_0', 'pci_0000_01_02_3', 'pci_0000_01_01_6', 'pci_0000_81_02_3', 'pci_0000_a1_00_4', 'pci_0000_a1_01_5', 'pci_0000_81_01_6', 'pci_0000_a1_02_1', 'pci_0000_01_01_3', 'pci_0000_a1_02_0', 'pci_0000_a1_02_3', 'pci_0000_01_02_7', 'pci_0000_81_02_7', 'pci_0000_81_00_7', 'pci_0000_81_00_6', 'pci_0000_01_02_6', 'pci_0000_a1_01_6', 'pci_0000_a1_02_5', 'pci_0000_a1_01_2', 'pci_0000_81_01_1', 'pci_0000_a1_01_7', 'pci_0000_a1_02_7', 'pci_0000_01_00_6', 'pci_0000_81_00_5', 'pci_0000_01_00_5', 'pci_0000_01_01_1', 'pci_0000_01_01_7', 'pci_0000_81_01_0', 'pci_0000_01_02_5', 'pci_0000_81_02_6', 'pci_0000_a1_00_7', 'pci_0000_81_01_3', 'pci_0000_a1_02_6', 'pci_0000_01_00_7', 'pci_0000_01_01_4', 'pci_0000_01_02_2', 'pci_0000_a1_02_2', 'pci_0000_01_01_0', 'pci_0000_01_00_4', 'pci_0000_a1_00_6', 'pci_0000_a1_01_1', 'pci_0000_01_01_5', 'pci_0000_81_01_7', 'pci_0000_81_02_1'].
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Traceback (most recent call last):
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2748, in _build_resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     yield resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2512, in _build_and_run_instance
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     accel_info=accel_info)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4315, in spawn
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     mdevs = self._allocate_mdevs(allocations)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     return f(*args, **kwargs)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8300, in _allocate_mdevs
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     reason='mdev-capable resource is not available')
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.

Environment
===========

1. OpenStack version

Not sure how to tell from the kolla source images. nova seems to be
version 25.1.1. The top entries in the ChangeLog are:

CHANGES
=======

* Handle mdev devices in libvirt 7.7+
* Reproducer for bug 1951656
* ignore deleted server groups in validation
* add repoducer test for bug 1890244
* Add a workaround to skip hypervisor version check on LM

25.1.0
------

* [stable-only][cve] Check VMDK create-type against an allowed list
* Improving logging at '\_allocate\_mdevs'
* Gracefully ERROR in \_init\_instance if vnic\_type changed
* Reproduce bug 1981813 in func env
* Adapt websocketproxy tests for SimpleHTTPServer fix
* enable blocked VDPA move operations
* Add compute restart capability for libvirt func tests

2. Hypervisor

libvirt/KVM

3. Storage

NFS

4. Networking

Neutron with OVN

Logs & Configs
==============

I will attach a nova log in DEBUG mode of a failed instance creation.
Additional data can be provided if necessary.

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "instance_creation.log"
   https://bugs.launchpad.net/bugs/2015892/+attachment/5663037/+files/instance_creation.log

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2015892

Title:
  Unused pre-existing mediated devices not "available" on instance
  creation

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  We are running Yoga, deployed via Kolla-Ansible, on a number of
  compute hosts equipped with NVIDIA A100s. Since deploying images
  recently in order to pull in this fix
  (https://review.opendev.org/c/openstack/nova/+/866154), nova refuses
  to consider pre-existing mediated devices on instance creation, even
  though they are not in use by any instance. This happens both if we
  pre-create the mdevs (in an attempt to make their UUIDs persistent)
  and also if we leave the creation of mediated devices to nova. In the
  latter case, instance (and mdev) creation succeeds, but medvs are not
  cleaned up when instances are destroyed and after a while the hardware
  is filled up with unused mdevs left behind by past instances and new
  instances fail to be created.

  Steps to reproduce
  ==================

  1. Configure nova according to the "Attaching virtual GPUs devices to guests" guide (https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
  2. Pre-create mediated devices to fill up all available instances
  3. Create an instance with a vGPU attached

  Expected result
  ===============

  Instance is created with vGPU attached.

  Actual result
  =============

  Instance creation fails with "Error: Exceeded maximum number of
  retries. Exhausted all hosts available for retrying build failures for
  instance". Nova logs show:

  2023-04-11 17:37:40.156 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Available mdevs at: set().
  2023-04-11 17:37:40.253 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_01_02_1', 'pci_0000_a1_01_4', 'pci_0000_81_01_2', 'pci_0000_a1_01_3', 'pci_0000_81_02_5', 'pci_0000_81_02_2', 'pci_0000_01_01_2', 'pci_0000_81_01_5', 'pci_0000_81_00_4', 'pci_0000_01_02_0', 'pci_0000_a1_00_5', 'pci_0000_a1_01_0', 'pci_0000_01_02_4', 'pci_0000_81_02_4', 'pci_0000_a1_02_4', 'pci_0000_81_01_4', 'pci_0000_81_02_0', 'pci_0000_01_02_3', 'pci_0000_01_01_6', 'pci_0000_81_02_3', 'pci_0000_a1_00_4', 'pci_0000_a1_01_5', 'pci_0000_81_01_6', 'pci_0000_a1_02_1', 'pci_0000_01_01_3', 'pci_0000_a1_02_0', 'pci_0000_a1_02_3', 'pci_0000_01_02_7', 'pci_0000_81_02_7', 'pci_0000_81_00_7', 'pci_0000_81_00_6', 'pci_0000_01_02_6', 'pci_0000_a1_01_6', 'pci_0000_a1_02_5', 'pci_0000_a1_01_2', 'pci_0000_81_01_1', 'pci_0000_a1_01_7', 'pci_0000_a1_02_7', 'pci_0000_01_00_6', 'pci_0000_81_00_5', 'pci_0000_01_00_5', 'pci_0000_01_01_1', 'pci_0000_01_01_7', 'pci_0000_81_01_0', 'pci_0000_01_02_5', 'pci_0000_81_02_6', 'pci_0000_a1_00_7', 'pci_0000_81_01_3', 'pci_0000_a1_02_6', 'pci_0000_01_00_7', 'pci_0000_01_01_4', 'pci_0000_01_02_2', 'pci_0000_a1_02_2', 'pci_0000_01_01_0', 'pci_0000_01_00_4', 'pci_0000_a1_00_6', 'pci_0000_a1_01_1', 'pci_0000_01_01_5', 'pci_0000_81_01_7', 'pci_0000_81_02_1'].
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Traceback (most recent call last):
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2748, in _build_resources
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     yield resources
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2512, in _build_and_run_instance
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     accel_info=accel_info)
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4315, in spawn
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     mdevs = self._allocate_mdevs(allocations)
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     return f(*args, **kwargs)
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8300, in _allocate_mdevs
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34]     reason='mdev-capable resource is not available')
  2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.

  Environment
  ===========

  1. OpenStack version

  Not sure how to tell from the kolla source images. nova seems to be
  version 25.1.1. The top entries in the ChangeLog are:

  CHANGES
  =======

  * Handle mdev devices in libvirt 7.7+
  * Reproducer for bug 1951656
  * ignore deleted server groups in validation
  * add repoducer test for bug 1890244
  * Add a workaround to skip hypervisor version check on LM

  25.1.0
  ------

  * [stable-only][cve] Check VMDK create-type against an allowed list
  * Improving logging at '\_allocate\_mdevs'
  * Gracefully ERROR in \_init\_instance if vnic\_type changed
  * Reproduce bug 1981813 in func env
  * Adapt websocketproxy tests for SimpleHTTPServer fix
  * enable blocked VDPA move operations
  * Add compute restart capability for libvirt func tests

  2. Hypervisor

  libvirt/KVM

  3. Storage

  NFS

  4. Networking

  Neutron with OVN

  Logs & Configs
  ==============

  I will attach a nova log in DEBUG mode of a failed instance creation.
  Additional data can be provided if necessary.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2015892/+subscriptions