yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #91690
[Bug 2015892] [NEW] Unused pre-existing mediated devices not "available" on instance creation
Public bug reported:
Description
===========
We are running Yoga, deployed via Kolla-Ansible, on a number of compute
hosts equipped with NVIDIA A100s. Since deploying images recently in
order to pull in this fix
(https://review.opendev.org/c/openstack/nova/+/866154), nova refuses to
consider pre-existing mediated devices on instance creation, even though
they are not in use by any instance. This happens both if we pre-create
the mdevs (in an attempt to make their UUIDs persistent) and also if we
leave the creation of mediated devices to nova. In the latter case,
instance (and mdev) creation succeeds, but medvs are not cleaned up when
instances are destroyed and after a while the hardware is filled up with
unused mdevs left behind by past instances and new instances fail to be
created.
Steps to reproduce
==================
1. Configure nova according to the "Attaching virtual GPUs devices to guests" guide (https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
2. Pre-create mediated devices to fill up all available instances
3. Create an instance with a vGPU attached
Expected result
===============
Instance is created with vGPU attached.
Actual result
=============
Instance creation fails with "Error: Exceeded maximum number of retries.
Exhausted all hosts available for retrying build failures for instance".
Nova logs show:
2023-04-11 17:37:40.156 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Available mdevs at: set().
2023-04-11 17:37:40.253 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_01_02_1', 'pci_0000_a1_01_4', 'pci_0000_81_01_2', 'pci_0000_a1_01_3', 'pci_0000_81_02_5', 'pci_0000_81_02_2', 'pci_0000_01_01_2', 'pci_0000_81_01_5', 'pci_0000_81_00_4', 'pci_0000_01_02_0', 'pci_0000_a1_00_5', 'pci_0000_a1_01_0', 'pci_0000_01_02_4', 'pci_0000_81_02_4', 'pci_0000_a1_02_4', 'pci_0000_81_01_4', 'pci_0000_81_02_0', 'pci_0000_01_02_3', 'pci_0000_01_01_6', 'pci_0000_81_02_3', 'pci_0000_a1_00_4', 'pci_0000_a1_01_5', 'pci_0000_81_01_6', 'pci_0000_a1_02_1', 'pci_0000_01_01_3', 'pci_0000_a1_02_0', 'pci_0000_a1_02_3', 'pci_0000_01_02_7', 'pci_0000_81_02_7', 'pci_0000_81_00_7', 'pci_0000_81_00_6', 'pci_0000_01_02_6', 'pci_0000_a1_01_6', 'pci_0000_a1_02_5', 'pci_0000_a1_01_2', 'pci_0000_81_01_1', 'pci_0000_a1_01_7', 'pci_0000_a1_02_7', 'pci_0000_01_00_6', 'pci_0000_81_00_5', 'pci_0000_01_00_5', 'pci_0000_01_01_1', 'pci_0000_01_01_7', 'pci_0000_81_01_0', 'pci_0000_01_02_5', 'pci_0000_81_02_6', 'pci_0000_a1_00_7', 'pci_0000_81_01_3', 'pci_0000_a1_02_6', 'pci_0000_01_00_7', 'pci_0000_01_01_4', 'pci_0000_01_02_2', 'pci_0000_a1_02_2', 'pci_0000_01_01_0', 'pci_0000_01_00_4', 'pci_0000_a1_00_6', 'pci_0000_a1_01_1', 'pci_0000_01_01_5', 'pci_0000_81_01_7', 'pci_0000_81_02_1'].
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Traceback (most recent call last):
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2748, in _build_resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] yield resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2512, in _build_and_run_instance
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] accel_info=accel_info)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4315, in spawn
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] mdevs = self._allocate_mdevs(allocations)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] return f(*args, **kwargs)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8300, in _allocate_mdevs
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] reason='mdev-capable resource is not available')
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
Environment
===========
1. OpenStack version
Not sure how to tell from the kolla source images. nova seems to be
version 25.1.1. The top entries in the ChangeLog are:
CHANGES
=======
* Handle mdev devices in libvirt 7.7+
* Reproducer for bug 1951656
* ignore deleted server groups in validation
* add repoducer test for bug 1890244
* Add a workaround to skip hypervisor version check on LM
25.1.0
------
* [stable-only][cve] Check VMDK create-type against an allowed list
* Improving logging at '\_allocate\_mdevs'
* Gracefully ERROR in \_init\_instance if vnic\_type changed
* Reproduce bug 1981813 in func env
* Adapt websocketproxy tests for SimpleHTTPServer fix
* enable blocked VDPA move operations
* Add compute restart capability for libvirt func tests
2. Hypervisor
libvirt/KVM
3. Storage
NFS
4. Networking
Neutron with OVN
Logs & Configs
==============
I will attach a nova log in DEBUG mode of a failed instance creation.
Additional data can be provided if necessary.
** Affects: nova
Importance: Undecided
Status: New
** Attachment added: "instance_creation.log"
https://bugs.launchpad.net/bugs/2015892/+attachment/5663037/+files/instance_creation.log
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2015892
Title:
Unused pre-existing mediated devices not "available" on instance
creation
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
We are running Yoga, deployed via Kolla-Ansible, on a number of
compute hosts equipped with NVIDIA A100s. Since deploying images
recently in order to pull in this fix
(https://review.opendev.org/c/openstack/nova/+/866154), nova refuses
to consider pre-existing mediated devices on instance creation, even
though they are not in use by any instance. This happens both if we
pre-create the mdevs (in an attempt to make their UUIDs persistent)
and also if we leave the creation of mediated devices to nova. In the
latter case, instance (and mdev) creation succeeds, but medvs are not
cleaned up when instances are destroyed and after a while the hardware
is filled up with unused mdevs left behind by past instances and new
instances fail to be created.
Steps to reproduce
==================
1. Configure nova according to the "Attaching virtual GPUs devices to guests" guide (https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
2. Pre-create mediated devices to fill up all available instances
3. Create an instance with a vGPU attached
Expected result
===============
Instance is created with vGPU attached.
Actual result
=============
Instance creation fails with "Error: Exceeded maximum number of
retries. Exhausted all hosts available for retrying build failures for
instance". Nova logs show:
2023-04-11 17:37:40.156 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Available mdevs at: set().
2023-04-11 17:37:40.253 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_01_02_1', 'pci_0000_a1_01_4', 'pci_0000_81_01_2', 'pci_0000_a1_01_3', 'pci_0000_81_02_5', 'pci_0000_81_02_2', 'pci_0000_01_01_2', 'pci_0000_81_01_5', 'pci_0000_81_00_4', 'pci_0000_01_02_0', 'pci_0000_a1_00_5', 'pci_0000_a1_01_0', 'pci_0000_01_02_4', 'pci_0000_81_02_4', 'pci_0000_a1_02_4', 'pci_0000_81_01_4', 'pci_0000_81_02_0', 'pci_0000_01_02_3', 'pci_0000_01_01_6', 'pci_0000_81_02_3', 'pci_0000_a1_00_4', 'pci_0000_a1_01_5', 'pci_0000_81_01_6', 'pci_0000_a1_02_1', 'pci_0000_01_01_3', 'pci_0000_a1_02_0', 'pci_0000_a1_02_3', 'pci_0000_01_02_7', 'pci_0000_81_02_7', 'pci_0000_81_00_7', 'pci_0000_81_00_6', 'pci_0000_01_02_6', 'pci_0000_a1_01_6', 'pci_0000_a1_02_5', 'pci_0000_a1_01_2', 'pci_0000_81_01_1', 'pci_0000_a1_01_7', 'pci_0000_a1_02_7', 'pci_0000_01_00_6', 'pci_0000_81_00_5', 'pci_0000_01_00_5', 'pci_0000_01_01_1', 'pci_0000_01_01_7', 'pci_0000_81_01_0', 'pci_0000_01_02_5', 'pci_0000_81_02_6', 'pci_0000_a1_00_7', 'pci_0000_81_01_3', 'pci_0000_a1_02_6', 'pci_0000_01_00_7', 'pci_0000_01_01_4', 'pci_0000_01_02_2', 'pci_0000_a1_02_2', 'pci_0000_01_01_0', 'pci_0000_01_00_4', 'pci_0000_a1_00_6', 'pci_0000_a1_01_1', 'pci_0000_01_01_5', 'pci_0000_81_01_7', 'pci_0000_81_02_1'].
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Traceback (most recent call last):
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2748, in _build_resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] yield resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2512, in _build_and_run_instance
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] accel_info=accel_info)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4315, in spawn
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] mdevs = self._allocate_mdevs(allocations)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] return f(*args, **kwargs)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8300, in _allocate_mdevs
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] reason='mdev-capable resource is not available')
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
Environment
===========
1. OpenStack version
Not sure how to tell from the kolla source images. nova seems to be
version 25.1.1. The top entries in the ChangeLog are:
CHANGES
=======
* Handle mdev devices in libvirt 7.7+
* Reproducer for bug 1951656
* ignore deleted server groups in validation
* add repoducer test for bug 1890244
* Add a workaround to skip hypervisor version check on LM
25.1.0
------
* [stable-only][cve] Check VMDK create-type against an allowed list
* Improving logging at '\_allocate\_mdevs'
* Gracefully ERROR in \_init\_instance if vnic\_type changed
* Reproduce bug 1981813 in func env
* Adapt websocketproxy tests for SimpleHTTPServer fix
* enable blocked VDPA move operations
* Add compute restart capability for libvirt func tests
2. Hypervisor
libvirt/KVM
3. Storage
NFS
4. Networking
Neutron with OVN
Logs & Configs
==============
I will attach a nova log in DEBUG mode of a failed instance creation.
Additional data can be provided if necessary.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2015892/+subscriptions