← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2008883] Re: VGPU - I can only use one MIG profile in Nova

 

im torn between considering this a wishlist bug or a feature request.

i think this is related perhaps to the resource provider mapings

with this configuration 
[devices]
enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
[vgpu_nvidia-474]
device_addresses = 0000:61:00.4,0000:61:01.0
[vgpu_nvidia-475]
device_addresses = 0000:61:01.7
[vgpu_nvidia-476]
device_addresses = 0000:61:00.6

i would expect there to be 4 resource providers created each with an
inventory of 1 vgpu

from the logs below

3cd4dbc7-2c2a-448d-a041-27c8fd685950
7d5abf99-3c42-4c62-ba33-15682c6cfc5b
5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0
58fbbedb-9845-4397-bd20-f559ba68daee

can you do an inventory show on each and confirm that.

looking at the flavor you appear ot have added the correct trait request
to have them target the appropriate rps.

the approach you are taking was replaced by the generic mdev feature in xena.
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/generic-mdevs.html

there instead of tagging the rp manually with a trait you would use a
different resource case per mdev type.

you are essically trying to use this feature

https://specs.openstack.org/openstack/nova-
specs/specs/stein/approved/vgpu-stein.html

but instead of having multiple; physical gpus you are trying to use mig
to partition the GPU first into VFs.

that was intended to be enabled  by 
https://specs.openstack.org/openstack/nova-specs/specs/ussuri/implemented/vgpu-multiple-types.html

however when that feature was implemented no released GPU supported mig
or multiple mdev types on the same card.

as such it was only ever tested with multiple mdev type on the same host
but with one pGUP per mdev_type

** Changed in: nova
   Importance: Undecided => Wishlist

** Changed in: nova
       Status: Invalid => Incomplete

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2008883

Title:
  VGPU - I can only use one MIG profile in Nova

Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  I have one Nvidia A100 card and can only use one MIG profile.

  I divided the card into 4 different MIG profiles 2x A100-1-5C, 1x
  A100-2-10C, 1x A100-3-20C, As below.

  +-----------------------------------------------------------------------------+
  | MIG devices:                                                                |
  +------------------+----------------------+-----------+-----------------------+
  | GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
  |      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
  |                  |                      |        ECC|                       |
  |==================+======================+===========+=======================|
  |  0    2   0   0  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
  |                  |      0MiB / 32767MiB |           |                       |
  +------------------+----------------------+-----------+-----------------------+
  |  0    3   0   1  |      6MiB /  9984MiB | 28      0 |  2   0    1    0    0 |
  |                  |      0MiB / 16383MiB |           |                       |
  +------------------+----------------------+-----------+-----------------------+
  |  0    9   0   2  |   4739MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
  |                  |      0MiB /  8191MiB |           |                       |
  +------------------+----------------------+-----------+-----------------------+
  |  0   10   0   3  |   4739MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
  |                  |      0MiB /  8191MiB |           |                       |
  +------------------+----------------------+-----------+-----------------------+

  
  My nova configuration:  /etc/nova/nova.conf

  [devices]
  enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
  [vgpu_nvidia-474]
  device_addresses = 0000:61:00.4,0000:61:01.0
  [vgpu_nvidia-475]
  device_addresses = 0000:61:01.7
  [vgpu_nvidia-476]
  device_addresses = 0000:61:00.6

  # openstack resource provider list
  +--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
  | uuid                                 | name                                         | generation | root_provider_uuid                   | parent_provider_uuid                 |
  +--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
  | a0269b89-d43d-4042-a64e-3c832f0bb23f | gpu-a01.example.os-tests.com                  |        104 | a0269b89-d43d-4042-a64e-3c832f0bb23f | None                                 |
  | f2e5a4e0-479e-4ee3-b504-36371ded49f5 | gpu-a01.example.os-tests.com_pci_0000_61_01_4 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | a513c661-6dd2-4462-b719-9fbf7b70c409 | gpu-a01.example.os-tests.com_pci_0000_61_01_2 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 9124d3a8-00fb-475e-a0f8-892ccf5d255e | gpu-a01.example.os-tests.com_pci_0000_61_00_7 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 5f443da2-3c75-45c6-9d8a-05ca8a487802 | gpu-a01.example.os-tests.com_pci_0000_61_02_1 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 20da8814-c5f0-4575-a785-579e9abdbb1d | gpu-a01.example.os-tests.com_pci_0000_61_01_3 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 37014baa-fba6-4f14-8be8-17084b3aad36 | gpu-a01.example.os-tests.com_pci_0000_61_01_1 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | a9e2f509-fb03-45a0-9ff0-7c50143c1a9c | gpu-a01.example.os-tests.com_pci_0000_61_02_2 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 202462de-7d01-45c7-b197-1f2ca5c9c7ae | gpu-a01.example.os-tests.com_pci_0000_61_02_3 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0 | gpu-a01.example.os-tests.com_pci_0000_61_01_7 |         43 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 315b5205-26ac-4ec6-b5d2-623cafc18f39 | gpu-a01.example.os-tests.com_pci_0000_61_01_6 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 208f395d-d9c7-4108-8f83-e48cbea0b637 | gpu-a01.example.os-tests.com_pci_0000_61_02_0 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 7d5abf99-3c42-4c62-ba33-15682c6cfc5b | gpu-a01.example.os-tests.com_pci_0000_61_00_4 |         18 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 315b2c35-401c-4165-add8-2b025961b9a0 | gpu-a01.example.os-tests.com_pci_0000_61_01_5 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | c43be42e-e564-46be-9025-4f00a1f7454e | gpu-a01.example.os-tests.com_pci_0000_61_00_5 |          1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 3cd4dbc7-2c2a-448d-a041-27c8fd685950 | gpu-a01.example.os-tests.com_pci_0000_61_01_0 |         14 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  | 58fbbedb-9845-4397-bd20-f559ba68daee | gpu-a01.example.os-tests.com_pci_0000_61_00_6 |         27 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
  +--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+

  Created Flavor:

  openstack --os-placement-api-version 1.6 trait create CUSTOM_N_1
  openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_1 3cd4dbc7-2c2a-448d-a041-27c8fd685950
  openstack flavor create --private  --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-1  --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_1=required

  openstack --os-placement-api-version 1.6 trait create CUSTOM_N_2
  openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_2 7d5abf99-3c42-4c62-ba33-15682c6cfc5b
  openstack flavor create --private  --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-2  --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_2=required

  openstack --os-placement-api-version 1.6 trait create CUSTOM_N_3
  openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_3 5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0
  openstack flavor create --private  --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-3  --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_3=required

  openstack --os-placement-api-version 1.6 trait create CUSTOM_N_4
  openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_4 58fbbedb-9845-4397-bd20-f559ba68daee
  openstack flavor create --private  --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-4  --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_4=required


  gpu-a01:~/nvidia-dev-ctl# ls /sys/class/mdev_bus/*/mdev_supported_types
  '/sys/class/mdev_bus/0000:61:00.4/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:00.5/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:00.6/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:00.7/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.0/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.1/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.2/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.3/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.4/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.5/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.6/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:01.7/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:02.0/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:02.1/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:02.2/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706

  '/sys/class/mdev_bus/0000:61:02.3/mdev_supported_types':
  nvidia-468  nvidia-469  nvidia-470  nvidia-471  nvidia-472  nvidia-473  nvidia-474  nvidia-475  nvidia-476  nvidia-477  nvidia-478  nvidia-706


  
  Problem:
  ===========
  I can create only two instances with A100-1-5C(nvidia-474) and types nvidia-475, nvidia-476 are omitted and I can't use them. 

  If I edit the nova config and replace in /etc/nova/nova.conf
  enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
  on
  enabled_vgpu_types = nvidia-475,nvidia-476
  I will be able to use only one A100-2-10C(nvidia-475) type.


  Packages:
  ============

  Version: Ussuri

  gpu-a01:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:    Ubuntu 20.04.4 LTS
  Release:        20.04
  Codename:       focal

  gpu-a01:~# dpkg -l | grep nova
  ii  nova-common                           2:21.2.4-0ubuntu1                                    all          OpenStack Compute - common files
  ii  nova-compute                          2:21.2.4-0ubuntu1                                    all          OpenStack Compute - compute node base
  ii  nova-compute-kvm                      2:21.2.4-0ubuntu1                                    all          OpenStack Compute - compute node (KVM)
  ii  nova-compute-libvirt                  2:21.2.4-0ubuntu1                                    all          OpenStack Compute - compute node libvirt support
  ii  python3-nova                          2:21.2.4-0ubuntu1                                    all          OpenStack Compute Python 3 libraries
  ii  python3-novaclient                    2:17.0.0-0ubuntu1                                    all          client library for OpenStack Compute API - 3.x

  gpu-a01:~# uname -a
  Linux compgpu-a01 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2008883/+subscriptions



References