← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1906494] [NEW] Placement error when using GPUs that is utilizing SR-IOV for VGPU

 

Public bug reported:

I'm experiencing some weird bugs/brokenness when trying to use the
Nvidia A100 for vGPU in nova. It's a little bit involved, but I tihnk I
have an idea of what's going on.

My setup:
Dell R7525 with 2xNvidia A100
CentOS 8.2
Openstack Ussuri (openstack-nova-compute.noarch 1:21.1.1-1.el8)
Nvidia GRID 11.2

A little preface
================
The Nvidia A100 is utilizing SR-IOV to create mdev files for vGPUs. The nvidia-driver ships a tool to enable SR-IOV on the physical cards, and this creates 16 VFs per card, which all appears in /sys/class/mdev_bus. Each VF can create only one mdev (no matter which GRID profile you are using). When nova-compute starts, it will automatically populate placement with (in my case) all the 32 VFs with the inventory of 1 VGPU each.

In my case, I create VMs with the GRID A100-20C profile (so, two VGPUs
per GPU). I've configured nova.conf with the correct
"devices/enabled_vgpu_types" (nvidia-472), and four of the 32 VFs as
valid in "vgpu_nvidia-472/device_addresses". In addition, I've set a
custom trait on the resource providers that corresponds to the devices
listed in nova.conf, and of course listed this trait in my flavor spec.

The problem
===========
When a physical card is full of vGPUs, nova-compute is seemingly trying to tell placement that all of the remaining VFs from that card now has 0 VGPUs in their inventory. This fails a JSON validation since all of the resource providers was created with 1 as minimum. Here's the python stacktrace from nova-compute.log: https://paste.ubuntu.com/p/TwfTkyMvqR/

The UUID from the stacktrace is _not_ one of the resource providers
which was successfully used to create a VGPU for the two VMs already
created. It's the UUID for the last VF on the same physical card when
listed from 'openstack resource provider list'

This error also stops me from:
* Creating more VMs with a VGPU (even though it's still physical cards that has free space)
* Deleting the existing VMs (They end up in an error state, but can then be deleted if I reboot the host)

The stacktrace is persistent through nova-compute restarts, but it
disappears on host reboot when the existing VMs is in the error state
after the deletion attempt...

Expected result
===============
Well, I expect nova to handle this properly.. I have no idea how, or why this happens, but I'm confident that you clever people will find a fix!

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: vgpu

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1906494

Title:
  Placement error when using GPUs that is utilizing SR-IOV for VGPU

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm experiencing some weird bugs/brokenness when trying to use the
  Nvidia A100 for vGPU in nova. It's a little bit involved, but I tihnk
  I have an idea of what's going on.

  My setup:
  Dell R7525 with 2xNvidia A100
  CentOS 8.2
  Openstack Ussuri (openstack-nova-compute.noarch 1:21.1.1-1.el8)
  Nvidia GRID 11.2

  A little preface
  ================
  The Nvidia A100 is utilizing SR-IOV to create mdev files for vGPUs. The nvidia-driver ships a tool to enable SR-IOV on the physical cards, and this creates 16 VFs per card, which all appears in /sys/class/mdev_bus. Each VF can create only one mdev (no matter which GRID profile you are using). When nova-compute starts, it will automatically populate placement with (in my case) all the 32 VFs with the inventory of 1 VGPU each.

  In my case, I create VMs with the GRID A100-20C profile (so, two VGPUs
  per GPU). I've configured nova.conf with the correct
  "devices/enabled_vgpu_types" (nvidia-472), and four of the 32 VFs as
  valid in "vgpu_nvidia-472/device_addresses". In addition, I've set a
  custom trait on the resource providers that corresponds to the devices
  listed in nova.conf, and of course listed this trait in my flavor
  spec.

  The problem
  ===========
  When a physical card is full of vGPUs, nova-compute is seemingly trying to tell placement that all of the remaining VFs from that card now has 0 VGPUs in their inventory. This fails a JSON validation since all of the resource providers was created with 1 as minimum. Here's the python stacktrace from nova-compute.log: https://paste.ubuntu.com/p/TwfTkyMvqR/

  The UUID from the stacktrace is _not_ one of the resource providers
  which was successfully used to create a VGPU for the two VMs already
  created. It's the UUID for the last VF on the same physical card when
  listed from 'openstack resource provider list'

  This error also stops me from:
  * Creating more VMs with a VGPU (even though it's still physical cards that has free space)
  * Deleting the existing VMs (They end up in an error state, but can then be deleted if I reboot the host)

  The stacktrace is persistent through nova-compute restarts, but it
  disappears on host reboot when the existing VMs is in the error state
  after the deletion attempt...

  Expected result
  ===============
  Well, I expect nova to handle this properly.. I have no idea how, or why this happens, but I'm confident that you clever people will find a fix!

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1906494/+subscriptions


Follow ups