← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1922264] [NEW] On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group

 

Public bug reported:

Description
===========
We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.

Steps to reproduce
==================

For example, on a node with 2 RTX8000 and 1 RTX6000.

$ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

$ nvidia-smi
Thu Apr  1 17:22:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04    Driver Version: 460.32.04    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:21:00.0 Off |                    0 |
| N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:81:00.0 Off |                    0 |
| N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     On   | 00000000:E2:00.0 Off |                    0 |
| N/A   30C    P8    24W / 250W |    150MiB / 23039MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Extract from nova.conf :
...
[devices]
enabled_vgpu_types = nvidia-428, nvidia-387

[vgpu_nvidia-428]
device_addresses = 0000:21:00.0,0000:81:00.0

[vgpu_nvidia-387]
device_addresses = 0000:e2:00.0


When nova-compute starts, log shows :
2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver [req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' was listed in '[devices] enabled_vgpu_types' but no corresponding '[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was defined. Only the first type 'nvidia-428' will be used.

And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
$ openstack resource provider list --os-placement-api-version 1.14 --in-tree f5d35bdc-b4b7-4764-a9d0-41f67fd95385
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid                                 | name                               | generation | root_provider_uuid                   | parent_provider_uuid                 |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02                  |         32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None                                 |
| 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 |          4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
| 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 |          2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+

In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types,
only nvidia-387 is loaded.


Expected result
===============
All gpu groups have to be loaded (as stated in docs).

Actual result
=============
Only the first gpu group is loaded.

Environment
===========
OpenStack Victoria was deployed with kolla-ansible.
NVIDIA GRID KVM drivers: 12.1 (latest)
System: Ubuntu 20.04.2
nova-compute version: 22.2.1

Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
Storage: Dell EMC Storage Center (7.3.20.19)
Network: neutron with OVN/OVS

** Affects: nova
     Importance: Undecided
         Status: New

** Summary changed:

- on a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config
+ On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config

** Summary changed:

- On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config
+ On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1922264

Title:
  On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load
  second group

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
  Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.

  Steps to reproduce
  ==================

  For example, on a node with 2 RTX8000 and 1 RTX6000.

  $ lspci | grep -i nvidia
  21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
  81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
  e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

  $ nvidia-smi
  Thu Apr  1 17:22:53 2021
  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 460.32.04    Driver Version: 460.32.04    CUDA Version: N/A      |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |                               |                      |               MIG M. |
  |===============================+======================+======================|
  |   0  Quadro RTX 8000     On   | 00000000:21:00.0 Off |                    0 |
  | N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default |
  |                               |                      |                  N/A |
  +-------------------------------+----------------------+----------------------+
  |   1  Quadro RTX 8000     On   | 00000000:81:00.0 Off |                    0 |
  | N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default |
  |                               |                      |                  N/A |
  +-------------------------------+----------------------+----------------------+
  |   2  Quadro RTX 6000     On   | 00000000:E2:00.0 Off |                    0 |
  | N/A   30C    P8    24W / 250W |    150MiB / 23039MiB |      0%      Default |
  |                               |                      |                  N/A |
  +-------------------------------+----------------------+----------------------+

  Extract from nova.conf :
  ...
  [devices]
  enabled_vgpu_types = nvidia-428, nvidia-387

  [vgpu_nvidia-428]
  device_addresses = 0000:21:00.0,0000:81:00.0

  [vgpu_nvidia-387]
  device_addresses = 0000:e2:00.0

  
  When nova-compute starts, log shows :
  2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver [req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' was listed in '[devices] enabled_vgpu_types' but no corresponding '[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was defined. Only the first type 'nvidia-428' will be used.

  And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
  $ openstack resource provider list --os-placement-api-version 1.14 --in-tree f5d35bdc-b4b7-4764-a9d0-41f67fd95385
  +--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
  | uuid                                 | name                               | generation | root_provider_uuid                   | parent_provider_uuid                 |
  +--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
  | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02                  |         32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None                                 |
  | 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 |          4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
  | 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 |          2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
  +--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+

  In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types,
  only nvidia-387 is loaded.

  
  Expected result
  ===============
  All gpu groups have to be loaded (as stated in docs).

  Actual result
  =============
  Only the first gpu group is loaded.

  Environment
  ===========
  OpenStack Victoria was deployed with kolla-ansible.
  NVIDIA GRID KVM drivers: 12.1 (latest)
  System: Ubuntu 20.04.2
  nova-compute version: 22.2.1

  Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
  Storage: Dell EMC Storage Center (7.3.20.19)
  Network: neutron with OVN/OVS

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1922264/+subscriptions


Follow ups