yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85707
[Bug 1922264] [NEW] On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group
Public bug reported:
Description
===========
We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.
Steps to reproduce
==================
For example, on a node with 2 RTX8000 and 1 RTX6000.
$ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
$ nvidia-smi
Thu Apr 1 17:22:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:21:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 On | 00000000:81:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 6000 On | 00000000:E2:00.0 Off | 0 |
| N/A 30C P8 24W / 250W | 150MiB / 23039MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Extract from nova.conf :
...
[devices]
enabled_vgpu_types = nvidia-428, nvidia-387
[vgpu_nvidia-428]
device_addresses = 0000:21:00.0,0000:81:00.0
[vgpu_nvidia-387]
device_addresses = 0000:e2:00.0
When nova-compute starts, log shows :
2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver [req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' was listed in '[devices] enabled_vgpu_types' but no corresponding '[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was defined. Only the first type 'nvidia-428' will be used.
And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
$ openstack resource provider list --os-placement-api-version 1.14 --in-tree f5d35bdc-b4b7-4764-a9d0-41f67fd95385
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02 | 32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None |
| 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 | 4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
| 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 | 2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types,
only nvidia-387 is loaded.
Expected result
===============
All gpu groups have to be loaded (as stated in docs).
Actual result
=============
Only the first gpu group is loaded.
Environment
===========
OpenStack Victoria was deployed with kolla-ansible.
NVIDIA GRID KVM drivers: 12.1 (latest)
System: Ubuntu 20.04.2
nova-compute version: 22.2.1
Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
Storage: Dell EMC Storage Center (7.3.20.19)
Network: neutron with OVN/OVS
** Affects: nova
Importance: Undecided
Status: New
** Summary changed:
- on a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config
+ On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config
** Summary changed:
- On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config
+ On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1922264
Title:
On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load
second group
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.
Steps to reproduce
==================
For example, on a node with 2 RTX8000 and 1 RTX6000.
$ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
$ nvidia-smi
Thu Apr 1 17:22:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:21:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 On | 00000000:81:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 6000 On | 00000000:E2:00.0 Off | 0 |
| N/A 30C P8 24W / 250W | 150MiB / 23039MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Extract from nova.conf :
...
[devices]
enabled_vgpu_types = nvidia-428, nvidia-387
[vgpu_nvidia-428]
device_addresses = 0000:21:00.0,0000:81:00.0
[vgpu_nvidia-387]
device_addresses = 0000:e2:00.0
When nova-compute starts, log shows :
2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver [req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' was listed in '[devices] enabled_vgpu_types' but no corresponding '[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was defined. Only the first type 'nvidia-428' will be used.
And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
$ openstack resource provider list --os-placement-api-version 1.14 --in-tree f5d35bdc-b4b7-4764-a9d0-41f67fd95385
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02 | 32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None |
| 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 | 4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
| 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 | 2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types,
only nvidia-387 is loaded.
Expected result
===============
All gpu groups have to be loaded (as stated in docs).
Actual result
=============
Only the first gpu group is loaded.
Environment
===========
OpenStack Victoria was deployed with kolla-ansible.
NVIDIA GRID KVM drivers: 12.1 (latest)
System: Ubuntu 20.04.2
nova-compute version: 22.2.1
Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
Storage: Dell EMC Storage Center (7.3.20.19)
Network: neutron with OVN/OVS
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1922264/+subscriptions
Follow ups