← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2117481] [NEW] disk and interfaces not handling the pcie device limit for q35

 

Public bug reported:

The num_pcie_ports libvirt option defines the total number of
available PCIe slots for an instance to hotplug using the q35 hardware machine type.

https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=ignored%20by%20nova.-,num_pcie_ports,-%C2%B6

Since both volume attachments and virtual NICs (Neutron ports)
consume PCIe slots or precisely pcie-root-port, the "max_disk_devices_to_attach" configuration
option is suboptimal because it doesn't account for the NICs/Ports attached to the VM.

https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=means%20no%20limit.-,max_disk_devices_to_attach,-%C2%B6

This can lead to a resource allocation issue and config setting which will never be applied correctly.
For example, consider the following configuration:

num_pcie_ports = 19
max_disk_devices_to_attach = 15

A user could create a VM with 5 Ports and then attach 14 volumes, consuming all 19 available PCIe slots.
If they then try to attach another volume, libvirt will deny the request and raise a
"No more available PCI slots exception".

Crucially, OpenStack doesn't inform the user with a HTTP 500 or 403 that
the volume attachment is failing due to a lack of available PCIe slots,
which causes confusion. In this scenario, the
"max_disk_devices_to_attach" limit can't be even reached if the VM is
configured with more than 5 Ports, as the instance runs out of PCIe
slots first.

This silent failure only applies to volume attachments. Attempting to
add another Port for example returns a "500 Failed to attach network
adapter device error". However, this message also obscures the root
cause of the failure, as it doesn't expose the underlying libvirt
exception.

We created a patch that checks for available PCIe ports during both
volume and network interface attachments. This check respects the
max_disk_devices_to_attach configuration option.

Ideally, the num_pcie_ports configuration should define the actual limit
for attachable PCIe devices. However, in our QEMU + Libvirt environment,
this setting is unreliable. For example, when num_pcie_ports is set to
the default maximum of 28, the instance only has 25 available PCIe
ports. For some unknown reason, three ports are always missing.

This discrepancy causes the instance to run out of PCIe slots before the
attachment limit is ever reached, reintroducing the original problem.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: cinder config libvirt neutron volumes

** Description changed:

  The num_pcie_ports libvirt option defines the total number of
  available PCIe slots for an instance to hotplug using the q35 hardware machine type.
  
  https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=ignored%20by%20nova.-,num_pcie_ports,-%C2%B6
  
  Since both volume attachments and virtual NICs (Neutron ports)
  consume PCIe slots or precisely pcie-root-port, the "max_disk_devices_to_attach" configuration
  option is suboptimal because it doesn't account for the NICs/Ports attached to the VM.
  
+ https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=means%20no%20limit.-,max_disk_devices_to_attach,-%C2%B6
+ 
  This can lead to a resource allocation issue and config setting which will never be applied correctly.
  For example, consider the following configuration:
  
  num_pcie_ports = 19
  max_disk_devices_to_attach = 15
  
  A user could create a VM with 5 Ports and then attach 14 volumes, consuming all 19 available PCIe slots.
- If they then try to attach another volume, libvirt will deny the request and raise a 
+ If they then try to attach another volume, libvirt will deny the request and raise a
  "No more available PCI slots exception".
  
  Crucially, OpenStack doesn't inform the user with a HTTP 500 or 403 that
  the volume attachment is failing due to a lack of available PCIe slots,
  which causes confusion. In this scenario, the
  "max_disk_devices_to_attach" limit can't be even reached if the VM is
  configured with more than 5 Ports, as the instance runs out of PCIe
  slots first.
  
  This silent failure only applies to volume attachments. Attempting to
  add another Port for example returns a "500 Failed to attach network
  adapter device error". However, this message also obscures the root
  cause of the failure, as it doesn't expose the underlying libvirt
  exception.
  
- 
- We created a patch that checks for available PCIe ports during both volume and network interface attachments. This check respects the max_disk_devices_to_attach configuration option.
+ We created a patch that checks for available PCIe ports during both
+ volume and network interface attachments. This check respects the
+ max_disk_devices_to_attach configuration option.
  
  Ideally, the num_pcie_ports configuration should define the actual limit
  for attachable PCIe devices. However, in our QEMU + Libvirt environment,
  this setting is unreliable. For example, when num_pcie_ports is set to
  the default maximum of 28, the instance only has 25 available PCIe
  ports. For some unknown reason, three ports are always missing.
  
  This discrepancy causes the instance to run out of PCIe slots before the
  attachment limit is ever reached, reintroducing the original problem.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2117481

Title:
  disk and interfaces not handling the pcie device limit for q35

Status in OpenStack Compute (nova):
  New

Bug description:
  The num_pcie_ports libvirt option defines the total number of
  available PCIe slots for an instance to hotplug using the q35 hardware machine type.

  https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=ignored%20by%20nova.-,num_pcie_ports,-%C2%B6

  Since both volume attachments and virtual NICs (Neutron ports)
  consume PCIe slots or precisely pcie-root-port, the "max_disk_devices_to_attach" configuration
  option is suboptimal because it doesn't account for the NICs/Ports attached to the VM.

  https://docs.openstack.org/nova/latest/configuration/config.html#:~:text=means%20no%20limit.-,max_disk_devices_to_attach,-%C2%B6

  This can lead to a resource allocation issue and config setting which will never be applied correctly.
  For example, consider the following configuration:

  num_pcie_ports = 19
  max_disk_devices_to_attach = 15

  A user could create a VM with 5 Ports and then attach 14 volumes, consuming all 19 available PCIe slots.
  If they then try to attach another volume, libvirt will deny the request and raise a
  "No more available PCI slots exception".

  Crucially, OpenStack doesn't inform the user with a HTTP 500 or 403
  that the volume attachment is failing due to a lack of available PCIe
  slots, which causes confusion. In this scenario, the
  "max_disk_devices_to_attach" limit can't be even reached if the VM is
  configured with more than 5 Ports, as the instance runs out of PCIe
  slots first.

  This silent failure only applies to volume attachments. Attempting to
  add another Port for example returns a "500 Failed to attach network
  adapter device error". However, this message also obscures the root
  cause of the failure, as it doesn't expose the underlying libvirt
  exception.

  We created a patch that checks for available PCIe ports during both
  volume and network interface attachments. This check respects the
  max_disk_devices_to_attach configuration option.

  Ideally, the num_pcie_ports configuration should define the actual
  limit for attachable PCIe devices. However, in our QEMU + Libvirt
  environment, this setting is unreliable. For example, when
  num_pcie_ports is set to the default maximum of 28, the instance only
  has 25 available PCIe ports. For some unknown reason, three ports are
  always missing.

  This discrepancy causes the instance to run out of PCIe slots before
  the attachment limit is ever reached, reintroducing the original
  problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2117481/+subscriptions



Follow ups