yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96046
[Bug 2114947] Re: Nova/Placement ignores the flavor’s resource_class + trait constraints when scheduling SR-IOV vGPU devices.
** Changed in: nova
Status: New => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2114947
Title:
Nova/Placement ignores the flavor’s resource_class + trait constraints
when scheduling SR-IOV vGPU devices.
Status in OpenStack Compute (nova):
Invalid
Bug description:
Nova/Placement ignores the flavor’s resource_class + trait constraints
when scheduling SR-IOV vGPU devices.
Environment
Deployment : OpenStack Epoxy 2025.1 (Kolla-Ansible)
Hypervisor node : Ubuntu 24.04, NVIDIA vGPU driver 570.148
Hardware : 10xRTX 6000 Ada cards in SR-IOV mode
PCI config :
[pci]
report_in_placement = true
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:4F:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:52:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:53:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:56:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:57:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:ce:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d1:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d2:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:00.5", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:00.6", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:00.7", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:01.0", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d5:01.1", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d6:00.4", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "managed":"no" }
device_spec = { "vendor_id":"10de", "product_id":"26b1", "address":"0000:d6:00.5", "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "managed":"no" }
alias = { "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_48Q", "device_type":"type-VF", "name":"rtx6000-ada-48q" }
alias = { "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_8Q", "device_type":"type-VF", "name":"rtx6000-ada-8q" }
alias = { "resource_class":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "traits":"CUSTOM_NVIDIA_RTX6000_ADA_24Q", "device_type":"type-VF", "name":"rtx6000-ada-24q" }
openstack flavor create 8xRTX-ADA-48Q --private \
--ram 4096 --vcpu 4 --disk 0 \
--property "resources:CUSTOM_NVIDIA_RTX6000_ADA_48Q"=1 \
--property "trait:CUSTOM_NVIDIA_RTX6000_ADA_48Q"="required" \
--property "pci_passthrough:alias"="rtx6000-ada-48q:8"
openstack flavor set --project admin 8xRTX-ADA-48Q
openstack flavor create 2xRTX-ADA-24Q --private \
--ram 4096 --vcpu 4 --disk 0 \
--property "resources:CUSTOM_NVIDIA_RTX6000_ADA_24Q"=1 \
--property "trait:CUSTOM_NVIDIA_RTX6000_ADA_24Q"="required" \
--property "pci_passthrough:alias"="rtx6000-ada-24q:2"
openstack flavor set --project admin 2xRTX-ADA-24Q
openstack flavor create 6xRTX-ADA-8Q --private \
--ram 4096 --vcpu 4 --disk 0 \
--property "resources:CUSTOM_NVIDIA_RTX6000_ADA_8Q"=1 \
--property "trait:CUSTOM_NVIDIA_RTX6000_ADA_8Q"="required" \
--property "pci_passthrough:alias"="rtx6000-ada-8q:6"
openstack flavor set --project admin 6xRTX-ADA-8Q
Each PF is enabled for VFs and the current_vgpu_type of all VFs are
set respectively to the profiles you see above each boot.
Steps to reproduce
Create instances that utilize one of each flavor, starting from
8x48G
Verify host inventory shows 2 free 24Q VFs and 6 free 8Q VFs.
Boot an instance with 2 24Q VFs
It seems that instances come up with 8 GB VRAM instead of 24 GB.
This is not unique to these specific vgpu profiles, in general,
nova will mismatch resources typically substituting lower Q in place
of higher ones. In some cases it will correctly provide the right
resource. For example, if I were to create an instance utilizing all
the 8Q VFs first (and it does so correctly which it seems to do
consistently) then openstack will proceed to also correctly assign the
2x 24Q VFs seemingly because its the last resource left to assign.
In my case, repeatably recreating the instance, it repeatably
spawns with the incorrect VFs attached. I remove the instance
containing the 2xRTX-ADA-24Q flavor (and mismatching 2x8Q resources),
perform a clean reboot (nvidia-vgpu-vfio driver complains it fails to
post VM shutdown event on all mounted VFs). At boot, I confirm my
8x48Q instance is correct, and then It once again incorrectly spawns
my 2xRTX-ADA-24Q with 2 8Q VFs.
Expected result
Scheduler should allocate only resource providers offering
CUSTOM_NVIDIA_RTX6000_ADA_24Q and the matching trait; guest should
always see a 24 GB framebuffer. If there aren't enough available, it should error out and not substitute it for another VF, ever.
Actual result
Placement allocation occasionally contains a provider with
CUSTOM_NVIDIA_RTX6000_ADA_8Q. Instance builds successfully; inside
the guest nvidia-smi reports 2x 0 MiB / 8 192 MiB VRAM.
Impact
Workloads requiring >8 GB fail or OOM. Operators must manually rebuild
affected VMs, defeating automated scheduling.
Evidence (example of a failed VM)
Conductor log:
2025-06-18 18:20:06.162 1102 INFO nova.compute.rpcapi [None
req-4554da9d-0c42-4dee-b35e-660b2a4ebd64 - - - - - -] Automatically
selected compute RPC version 6.4 from minimum service version 68
Compute log :
2025-06-18 18:20:07.365 7 INFO nova.compute.claims [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Claim successful on node LBRN-HV
2025-06-18 18:20:10.983 7 INFO nova.compute.pci_placement_translator [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] Placement PCI resource view: Placement PCI view on LBRN-HV: RP(LBRN-HV_0000:4F:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:52:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:53:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:56:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:57:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:CE:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D1:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D2:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D5:00.0, CUSTOM_NVIDIA_RTX6000_ADA_8Q=6, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_8Q), RP(LBRN-HV_0000:D6:00.0, CUSTOM_NVIDIA_RTX6000_ADA_24Q=2, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_24Q)
2025-06-18 18:20:10.985 7 INFO nova.scheduler.client.report [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] Performing resource provider inventory and allocation data migration.
2025-06-18 18:20:14.084 7 INFO nova.virt.libvirt.driver [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names
2025-06-18 18:20:15.109 7 INFO nova.virt.block_device [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Booting with volume snapshot 43de2624-e662-4c04-9d34-b783c28765a9 at /dev/vda
2025-06-18 18:20:19.056 7 INFO os_brick.initiator.connectors.lightos [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] Current host hostNQN nqn.2014-08.org.nvmexpress:uuid:0229145d-80ab-5a47-9954-89ba44d6e654 and IP(s) are ['172.31.21.2', '172.31.21.250', 'fe80::826a:e924:f880:a9b6', '172.31.1.11', '172.31.1.250', 'fe80::9597:33d0:bb62:d15', '192.168.122.1', 'fe80::8ba:dc94:6214:cfc9', 'fe80::b532:519:7aca:62b1', 'fe80::b451:bdff:feee:f5e1', 'fe80::1015:f3ff:fe84:45ea', 'fe80::64b5:c8ff:fe76:798b', 'fe80::d051:ceff:fe48:fc12', 'fe80::fc16:3eff:fe0d:2e39']
2025-06-18 18:20:21.749 7 INFO nova.virt.libvirt.driver [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Creating image(s)
2025-06-18 18:20:21.813 7 INFO os_brick.initiator.connectors.iscsi [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] Trying to connect to iSCSI portal 172.31.21.2:3260
2025-06-18 18:20:23.643 7 INFO os_vif [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] Successfully plugged vif VIFBridge(active=False,address=fa:16:3e:a6:79:2c,bridge_name='qbr920efb74-78',has_traffic_filtering=True,id=920efb74-783a-4484-924e-2e6d7c560781,network=Network(7a3552f8-c90c-4fcd-a32f-9e3bee272b89),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap920efb74-78')
2025-06-18 18:20:28.858 7 INFO nova.compute.pci_placement_translator [None req-85589a64-2bfe-44b3-b93a-fd595e6b7e7d - - - - - -] Placement PCI resource view: Placement PCI view on LBRN-HV: RP(LBRN-HV_0000:4F:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:52:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:53:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:56:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:57:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:CE:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D1:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D2:00.0, CUSTOM_NVIDIA_RTX6000_ADA_48Q=1, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_48Q), RP(LBRN-HV_0000:D5:00.0, CUSTOM_NVIDIA_RTX6000_ADA_8Q=6, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_8Q), RP(LBRN-HV_0000:D6:00.0, CUSTOM_NVIDIA_RTX6000_ADA_24Q=2, traits=COMPUTE_MANAGED_PCI_DEVICE,CUSTOM_NVIDIA_RTX6000_ADA_24Q)
2025-06-18 18:20:29.443 7 INFO nova.compute.manager [None req-776dc67f-a3f0-473a-a8b3-3bf4b0f859aa - - - - - -] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] VM Started (Lifecycle Event)
2025-06-18 18:20:29.457 7 INFO nova.virt.libvirt.driver [-] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Instance spawned successfully.
2025-06-18 18:20:29.458 7 INFO nova.compute.manager [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Took 7.71 seconds to spawn the instance on the hypervisor.
2025-06-18 18:20:29.995 7 INFO nova.compute.manager [None req-8d7975d5-780b-459b-ac4e-8ef5810d8bbe 25c2808d552741ce849e4fd9b320065b 073808578a7e4e8aa0ceebd1a69b34a6 - - default default] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] Took 22.78 seconds to build instance.
2025-06-18 18:20:30.466 7 INFO nova.compute.manager [None req-776dc67f-a3f0-473a-a8b3-3bf4b0f859aa - - - - - -] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] VM Paused (Lifecycle Event)
2025-06-18 18:20:30.980 7 INFO nova.compute.manager [None req-776dc67f-a3f0-473a-a8b3-3bf4b0f859aa - - - - - -] [instance: 67f35cb5-44b5-409c-b3df-8d30397ff232] VM Resumed (Lifecycle Event)
Resource provider troubleshooting:
openstack server list
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
| 1ea4855c-06c2-44d8-99e2-a2b4bc463e1b | 2xRTX-ADA-24Q | ACTIVE | inter=192.168.0.117 | N/A (booted from volume) | 2xRTX-ADA-24Q |
| fe09bf4a-cd87-4916-9b8d-18ceaed50c92 | 8xRTX-ADA-48Q | ACTIVE | inter=192.168.0.159 | N/A (booted from volume) | 8xRTX-ADA-48Q |
| c96e76a9-b122-45e3-b7fa-111be3d90922 | Win11-vm | SHUTOFF | inter=192.168.0.142 | N/A (booted from volume) | m1.medium |
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
openstack resource provider allocation show 1ea4855c-06c2-44d8-99e2-a2b4bc463e1b
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
| resource_provider | generation | resources | project_id | user_id |
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
| 8b21748f-d43e-48b7-b3ca-46d565c819ce | 213 | {'VCPU': 4, 'MEMORY_MB': 4096} | 073808578a7e4e8aa0ceebd1a69b34a6 | 25c2808d552741ce849e4fd9b320065b |
| 19a368c6-52fe-4bf6-a6e6-19bd45f1538c | 35 | {'CUSTOM_NVIDIA_RTX6000_ADA_24Q': 1} | 073808578a7e4e8aa0ceebd1a69b34a6 | 25c2808d552741ce849e4fd9b320065b |
| b73d2e49-bf73-40ab-b022-5e3994993090 | 114 | {'CUSTOM_NVIDIA_RTX6000_ADA_8Q': 2} | 073808578a7e4e8aa0ceebd1a69b34a6 | 25c2808d552741ce849e4fd9b320065b |
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
cat /sys/bus/pci/devices/0000\:d6\:00.4/nvidia/current_vgpu_type
949
cat /sys/bus/pci/devices/0000\:d6\:00.5/nvidia/current_vgpu_type
949
nvidia-smi vgpu
Wed Jun 18 18:14:55 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.06 Driver Version: 570.148.06 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA RTX 6000 Ada Ge... | 00000000:4F:00.0 | 0% |
| 3251634352 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 1 NVIDIA RTX 6000 Ada Ge... | 00000000:52:00.0 | 0% |
| 3251634357 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 2 NVIDIA RTX 6000 Ada Ge... | 00000000:53:00.0 | 0% |
| 3251634387 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 3 NVIDIA RTX 6000 Ada Ge... | 00000000:56:00.0 | 0% |
| 3251634362 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 4 NVIDIA RTX 6000 Ada Ge... | 00000000:57:00.0 | 0% |
| 3251634367 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 5 NVIDIA RTX 6000 Ada Ge... | 00000000:CE:00.0 | 0% |
| 3251634372 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 6 NVIDIA RTX 6000 Ada Ge... | 00000000:D1:00.0 | 0% |
| 3251634377 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 7 NVIDIA RTX 6000 Ada Ge... | 00000000:D2:00.0 | 0% |
| 3251634382 NVIDIA RTX6... | fe09... instance-0000003c | 0% |
+---------------------------------+------------------------------+------------+
| 8 NVIDIA RTX 6000 Ada Ge... | 00000000:D5:00.0 | 0% |
| 3251634392 NVIDIA RTX6... | 1ea4... instance-00000043 | 0% |
| 3251634398 NVIDIA RTX6... | 1ea4... instance-00000043 | 0% |
+---------------------------------+------------------------------+------------+
| 9 NVIDIA RTX 6000 Ada Ge... | 00000000:D6:00.0 | 0% |
+---------------------------------+------------------------------+------------+
Notes:
- The biggest red flag here is that openstack seems to believe that the instance has 1x24Q and 2x8Q, this is entirely false, the instance only sees 2 8Q.
- NVIDA-SMI VGPU command reports correctly on the current state, only 2x8Q VFs are used.
- mdev is not enabled here, which is the traditional, supported route
I want to emphasize that if I change the order of my instance
spawning, I can get the correct configuration. If I start with 8x48Q
profiles, then 6x8Q profiles, followed by 2x24Q profiles, it spawns
perfectly:
openstack server list
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
| c439a08f-6b5e-42f7-b6ff-dfcb03a64176 | 2xRTX-ADA-24Q | ACTIVE | inter=192.168.0.82 | N/A (booted from volume) | 2xRTX-ADA-24Q |
| ae11ce9e-fa05-417b-be3a-15404b8da9e3 | 6xRTX-ADA-8Q | ACTIVE | inter=192.168.0.107 | N/A (booted from volume) | 6xRTX-ADA-8Q |
| fe09bf4a-cd87-4916-9b8d-18ceaed50c92 | 8xRTX-ADA-48Q | ACTIVE | inter=192.168.0.159 | N/A (booted from volume) | 8xRTX-ADA-48Q |
| c96e76a9-b122-45e3-b7fa-111be3d90922 | Win11-vm | SHUTOFF | inter=192.168.0.142 | N/A (booted from volume) | m1.medium |
| b7417e4f-a647-493b-9cb6-6f76a73e7a9a | Ubuntu 24.04 LTS | SHELVED_OFFLOADED | dmz= | N/A (booted from volume) | m1.tiny |
+--------------------------------------+------------------+-------------------+---------------------+--------------------------+---------------+
openstack resource provider allocation show c439a08f-6b5e-42f7-b6ff-dfcb03a64176
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
| resource_provider | generation | resources | project_id | user_id |
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
| 8b21748f-d43e-48b7-b3ca-46d565c819ce | 225 | {'VCPU': 4, 'MEMORY_MB': 4096} | 073808578a7e4e8aa0ceebd1a69b34a6 | 25c2808d552741ce849e4fd9b320065b |
| 19a368c6-52fe-4bf6-a6e6-19bd45f1538c | 45 | {'CUSTOM_NVIDIA_RTX6000_ADA_24Q': 2} | 073808578a7e4e8aa0ceebd1a69b34a6 | 25c2808d552741ce849e4fd9b320065b |
+--------------------------------------+------------+--------------------------------------+----------------------------------+----------------------------------+
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2114947/+subscriptions
References