← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2125782] [NEW] PCI slots exhaustion is opaque to the API user

 

Public bug reported:

Recently, with proliferation of Kubernetes and AI-specific workloads, we
started seeing more frequent cases where users or some automation
attempt to hot-plug more PCI(e) devices (ports, volumes) than the
qemu/libvirt virtual machine of given machine type can handle.

For example, q35 machine type family have variable number of free PCIe
slots pre-created on boot by Nova (configurable, and currently limited
to 28). For i440fx machines, total number of PCI devices is limited to
32.

The problem is that the error user receives in this case is standard
HTTP 500 error with no information on the cause, and so user can attempt
to retry the request even though it will definitely won't succeed unless
some other device is unplugged from the instance, or nova is
reconfigured and instance is hard-rebooted to acquire more PCIe slots.

IMO in this particular case, it would be beneficial to share the error
with the user, by for example returning HTTP 409 Conflict, (or at least
400 Bad Request) with clear message indicating what the problem was, so
that the user (or some automation) can make educated decision on how to
proceed.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2125782

Title:
  PCI slots exhaustion is opaque to the API user

Status in OpenStack Compute (nova):
  New

Bug description:
  Recently, with proliferation of Kubernetes and AI-specific workloads,
  we started seeing more frequent cases where users or some automation
  attempt to hot-plug more PCI(e) devices (ports, volumes) than the
  qemu/libvirt virtual machine of given machine type can handle.

  For example, q35 machine type family have variable number of free PCIe
  slots pre-created on boot by Nova (configurable, and currently limited
  to 28). For i440fx machines, total number of PCI devices is limited to
  32.

  The problem is that the error user receives in this case is standard
  HTTP 500 error with no information on the cause, and so user can
  attempt to retry the request even though it will definitely won't
  succeed unless some other device is unplugged from the instance, or
  nova is reconfigured and instance is hard-rebooted to acquire more
  PCIe slots.

  IMO in this particular case, it would be beneficial to share the error
  with the user, by for example returning HTTP 409 Conflict, (or at
  least 400 Bad Request) with clear message indicating what the problem
  was, so that the user (or some automation) can make educated decision
  on how to proceed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2125782/+subscriptions