← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2127472] Re: Nova compute agent reports dev_type as type-PF for GPUs despite type-PCI configuration

 

This bug report is invliad.

there is a misuderstaind of whty device_type in the alise is for

it is not a way to configure the device type its explcitly a way to
match on a device based on its capabliteis similar to vendor id or
product id.

it exsts because there are device that can have diffent pcie capabliteis
based on the uefi configurat, firmware or other factors such as intel
QAT devices or nvidia gpus.

when you have a mix of such devices on a phsyical server with diffent
configuration teh device_type in the alias allows you to filter the
aviable devices.

the expected behaivor is invlid and not how this is intened to work in
nova so the propsoed soltion is also invalid as it implementing a new
feature that will break existign behvior.

** Changed in: nova
       Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2127472

Title:
  Nova compute agent reports dev_type as type-PF for GPUs despite type-
  PCI configuration

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Problem Description:

  The Nova compute agent incorrectly identifies a GPU passthrough device
  as dev_type='type-PF' in the resource tracker's final view, even when
  the configuration explicitly sets it to type-PCI using [pci]alias.

  This occurs because the libvirt driver's hardware detection logic
  identifies the device as a Physical Function (PF) if it is SR-IOV-
  capable. This hardware detection takes precedence over the operator's
  configuration in nova.conf. The presence of the <capability
  type='virt_functions'> element in the device's XML description from
  libvirt is what triggers this behavior.

  The incorrect dev_type prevents the Nova scheduler from correctly
  matching the device for instances that require a type-PCI device. This
  is particularly problematic for use cases like full GPU passthrough
  with non-GRID NVIDIA drivers, which cannot be loaded in the guest if
  the device is presented in SR-IOV mode.

  Expected Behavior:

  The Nova compute agent should honor the device_type specified in the
  [pci]alias configuration, using it to override the hardware-detected
  device type. If a device is configured as type-PCI, the resource
  tracker should report it as such, regardless of its underlying SR-IOV
  capabilities. Or at least the documentation should clarify this
  configuration value and autodetection logic behavior, and how the
  former must match the latter, or when it may be not.

  Steps to Reproduce:
  * Configure a compute node with a GPU that supports SR-IOV (e.g., an NVIDIA L4 or similar).
  * In nova.conf on the compute node, configure PCI passthrough for this device using device_spec. Explicitly set the device_type to type-PCI.
  * Restart the nova-compute service.
  * Observe the nova-compute.log. The logs will show that the configuration is loaded correctly, but the final resource view reported by the resource tracker will show dev_type='type-PF'.

  Configuration:

  [pci]
  alias = {"vendor_id":"10de", "product_id":"27b8", "device_type":"type-PCI"}

  Logs:
  The nova-compute.log will show the device being reported incorrectly in the final resource view:

  DEBUG nova.compute.resource_tracker [...] Final resource view: ...
  pci_stats=[PciDevicePool(...,tags={...,dev_type='type-PF',...})]

  Proposed Fix:

  A patch has been developed that modifies the _get_device_type() function in nova/virt/libvirt/host.py. The fix changes the device discovery logic to prioritize the configuration from nova.conf.
  Before inspecting the device's hardware capabilities, the function now checks if a device_spec rule in the configuration matches the PCI device being probed.
  If a matching rule with an explicit device_type is found, that device_type is used, and the hardware detection logic is bypassed.
  If no specific device_type is configured for the device, the code falls back to the existing hardware detection mechanism.
  This change ensures that the operator's configuration is the source of truth for the device type, allowing the scheduler to function correctly. The fix also includes unit tests to validate the new behavior and prevent regressions.

  NOTE: do not apply configured device in alias for type-VF devices aw
  we have to preserve the integrity of the parent-child relationship
  between a Virtual Function (VF) and its Physical Function (PF).
  Overriding a VF's type would break this critical link and corrupt
  Nova's understanding of the host's hardware topology.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2127472/+subscriptions



References