← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1981631] [NEW] Nova fails to reuse mdev vgpu devices

 

Public bug reported:

Description:
============================
Hello we are experiencing a weird issue where Nova creates the mdev devices from virtual functions when none are created but then will not reuse them once they are all created and vgpu instances are removed.


I believe part of this issue was the uuid issue from this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1701281

Manually applying the latest patch partially fixed the issue (placement
stopped reporting no hosts available), now the error is on the
hypervisor side saying 'no vgpu resources available'.

If I manually remove the mdev device by with commands like the following:
echo "1" > /sys/bus/mdev/devices/150c155c-da0b-45a6-8bc1-a8016231b100/remove

then Im able to spin up an instance again.

all mdev devices match in mdevctl list and virsh nodedev-list

Steps to reproduce:
================================
1) freshly setup hypervisor with no mdev devices created yet
2) spin up vgpu instances until all mdevs are created that will fit on physical gpu(s)
3) delete vgpu instances
4) try and spin up new vgpu instances

Expected Result:
=====================================
Instance spin up and use reuse the mdev vgpu devices

Actual Result:
=====================================
Build error from Nova API:
Error: Failed to perform requested operation on instance "colby_gpu_test23", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance c18565f9-da37-42e9-97b9-fa33da5f1ad0.].

Error in hypervisor logs:
nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: vGPU resource is not available

mdevctl output:
cdc98056-8597-4531-9e55-90ab44a71b4e 0000:21:00.7 nvidia-563 manual
298f1e4b-784d-42a9-b3e5-bdedd0eeb8e1 0000:21:01.2 nvidia-563 manual
2abee89e-8cb4-4727-ac2f-62888daab7b4 0000:21:02.4 nvidia-563 manual
32445186-57ca-43f4-b599-65a455fffe65 0000:21:04.2 nvidia-563 manual
0c4f5d07-2893-49a1-990e-4c74c827083b 0000:81:00.7 nvidia-563 manual
75d1b78c-b097-42a9-b736-4a8518b02a3d 0000:81:01.2 nvidia-563 manual
a54d33e0-9ddc-49bb-8908-b587c72616a9 0000:81:02.5 nvidia-563 manual
cd7a49a8-9306-41bb-b44e-00374b1e623a 0000:81:03.4 nvidia-563 manual

virsh nodedev-list -cap mdev:
mdev_0c4f5d07_2893_49a1_990e_4c74c827083b_0000_81_00_7
mdev_298f1e4b_784d_42a9_b3e5_bdedd0eeb8e1_0000_21_01_2
mdev_2abee89e_8cb4_4727_ac2f_62888daab7b4_0000_21_02_4
mdev_32445186_57ca_43f4_b599_65a455fffe65_0000_21_04_2
mdev_75d1b78c_b097_42a9_b736_4a8518b02a3d_0000_81_01_2
mdev_a54d33e0_9ddc_49bb_8908_b587c72616a9_0000_81_02_5
mdev_cd7a49a8_9306_41bb_b44e_00374b1e623a_0000_81_03_4
mdev_cdc98056_8597_4531_9e55_90ab44a71b4e_0000_21_00_7

nvidia-smi vgpu output:
Wed Jul 13 20:15:16 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06              Driver Version: 510.73.06                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA A40                 | 00000000:21:00.0             |   0%       |
|      3251635106  NVIDIA A40-12Q | 2786...  instance-00014520   |      0%    |
|      3251635117  NVIDIA A40-12Q | 6dc4...  instance-0001452f   |      0%    |
+---------------------------------+------------------------------+------------+
|   1  NVIDIA A40                 | 00000000:81:00.0             |   0%       |
|      3251635061  NVIDIA A40-12Q | 0d95...  instance-000144de   |      0%    |
|      3251635094  NVIDIA A40-12Q | 40a0...  instance-0001450e   |      0%    |
|      3251635112  NVIDIA A40-12Q | 776e...  instance-00014529   |      0%    |
+---------------------------------+------------------------------+------------+


Environment:
===========================================
Centos 8 stream
Victoria openstack version (Nova 22.4.0-1)
libvirt 8.0.0-6
qemu-kvm 6.2.0-12
Nvidia A40 GPUs

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "Full logs during failed vgpu instance creation"
   https://bugs.launchpad.net/bugs/1981631/+attachment/5602938/+files/nova_vgpu_fail_log.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1981631

Title:
  Nova fails to reuse mdev vgpu devices

Status in OpenStack Compute (nova):
  New

Bug description:
  Description:
  ============================
  Hello we are experiencing a weird issue where Nova creates the mdev devices from virtual functions when none are created but then will not reuse them once they are all created and vgpu instances are removed.

  
  I believe part of this issue was the uuid issue from this bug:
  https://bugzilla.redhat.com/show_bug.cgi?id=1701281

  Manually applying the latest patch partially fixed the issue
  (placement stopped reporting no hosts available), now the error is on
  the hypervisor side saying 'no vgpu resources available'.

  If I manually remove the mdev device by with commands like the following:
  echo "1" > /sys/bus/mdev/devices/150c155c-da0b-45a6-8bc1-a8016231b100/remove

  then Im able to spin up an instance again.

  all mdev devices match in mdevctl list and virsh nodedev-list

  Steps to reproduce:
  ================================
  1) freshly setup hypervisor with no mdev devices created yet
  2) spin up vgpu instances until all mdevs are created that will fit on physical gpu(s)
  3) delete vgpu instances
  4) try and spin up new vgpu instances

  Expected Result:
  =====================================
  Instance spin up and use reuse the mdev vgpu devices

  Actual Result:
  =====================================
  Build error from Nova API:
  Error: Failed to perform requested operation on instance "colby_gpu_test23", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance c18565f9-da37-42e9-97b9-fa33da5f1ad0.].

  Error in hypervisor logs:
  nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: vGPU resource is not available

  mdevctl output:
  cdc98056-8597-4531-9e55-90ab44a71b4e 0000:21:00.7 nvidia-563 manual
  298f1e4b-784d-42a9-b3e5-bdedd0eeb8e1 0000:21:01.2 nvidia-563 manual
  2abee89e-8cb4-4727-ac2f-62888daab7b4 0000:21:02.4 nvidia-563 manual
  32445186-57ca-43f4-b599-65a455fffe65 0000:21:04.2 nvidia-563 manual
  0c4f5d07-2893-49a1-990e-4c74c827083b 0000:81:00.7 nvidia-563 manual
  75d1b78c-b097-42a9-b736-4a8518b02a3d 0000:81:01.2 nvidia-563 manual
  a54d33e0-9ddc-49bb-8908-b587c72616a9 0000:81:02.5 nvidia-563 manual
  cd7a49a8-9306-41bb-b44e-00374b1e623a 0000:81:03.4 nvidia-563 manual

  virsh nodedev-list -cap mdev:
  mdev_0c4f5d07_2893_49a1_990e_4c74c827083b_0000_81_00_7
  mdev_298f1e4b_784d_42a9_b3e5_bdedd0eeb8e1_0000_21_01_2
  mdev_2abee89e_8cb4_4727_ac2f_62888daab7b4_0000_21_02_4
  mdev_32445186_57ca_43f4_b599_65a455fffe65_0000_21_04_2
  mdev_75d1b78c_b097_42a9_b736_4a8518b02a3d_0000_81_01_2
  mdev_a54d33e0_9ddc_49bb_8908_b587c72616a9_0000_81_02_5
  mdev_cd7a49a8_9306_41bb_b44e_00374b1e623a_0000_81_03_4
  mdev_cdc98056_8597_4531_9e55_90ab44a71b4e_0000_21_00_7

  nvidia-smi vgpu output:
  Wed Jul 13 20:15:16 2022       
  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 510.73.06              Driver Version: 510.73.06                 |
  |---------------------------------+------------------------------+------------+
  | GPU  Name                       | Bus-Id                       | GPU-Util   |
  |      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
  |=================================+==============================+============|
  |   0  NVIDIA A40                 | 00000000:21:00.0             |   0%       |
  |      3251635106  NVIDIA A40-12Q | 2786...  instance-00014520   |      0%    |
  |      3251635117  NVIDIA A40-12Q | 6dc4...  instance-0001452f   |      0%    |
  +---------------------------------+------------------------------+------------+
  |   1  NVIDIA A40                 | 00000000:81:00.0             |   0%       |
  |      3251635061  NVIDIA A40-12Q | 0d95...  instance-000144de   |      0%    |
  |      3251635094  NVIDIA A40-12Q | 40a0...  instance-0001450e   |      0%    |
  |      3251635112  NVIDIA A40-12Q | 776e...  instance-00014529   |      0%    |
  +---------------------------------+------------------------------+------------+


  
  Environment:
  ===========================================
  Centos 8 stream
  Victoria openstack version (Nova 22.4.0-1)
  libvirt 8.0.0-6
  qemu-kvm 6.2.0-12
  Nvidia A40 GPUs

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1981631/+subscriptions



Follow ups