yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96514
[Bug 2125782] [NEW] PCI slots exhaustion is opaque to the API user
Public bug reported:
Recently, with proliferation of Kubernetes and AI-specific workloads, we
started seeing more frequent cases where users or some automation
attempt to hot-plug more PCI(e) devices (ports, volumes) than the
qemu/libvirt virtual machine of given machine type can handle.
For example, q35 machine type family have variable number of free PCIe
slots pre-created on boot by Nova (configurable, and currently limited
to 28). For i440fx machines, total number of PCI devices is limited to
32.
The problem is that the error user receives in this case is standard
HTTP 500 error with no information on the cause, and so user can attempt
to retry the request even though it will definitely won't succeed unless
some other device is unplugged from the instance, or nova is
reconfigured and instance is hard-rebooted to acquire more PCIe slots.
IMO in this particular case, it would be beneficial to share the error
with the user, by for example returning HTTP 409 Conflict, (or at least
400 Bad Request) with clear message indicating what the problem was, so
that the user (or some automation) can make educated decision on how to
proceed.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2125782
Title:
PCI slots exhaustion is opaque to the API user
Status in OpenStack Compute (nova):
New
Bug description:
Recently, with proliferation of Kubernetes and AI-specific workloads,
we started seeing more frequent cases where users or some automation
attempt to hot-plug more PCI(e) devices (ports, volumes) than the
qemu/libvirt virtual machine of given machine type can handle.
For example, q35 machine type family have variable number of free PCIe
slots pre-created on boot by Nova (configurable, and currently limited
to 28). For i440fx machines, total number of PCI devices is limited to
32.
The problem is that the error user receives in this case is standard
HTTP 500 error with no information on the cause, and so user can
attempt to retry the request even though it will definitely won't
succeed unless some other device is unplugged from the instance, or
nova is reconfigured and instance is hard-rebooted to acquire more
PCIe slots.
IMO in this particular case, it would be beneficial to share the error
with the user, by for example returning HTTP 409 Conflict, (or at
least 400 Bad Request) with clear message indicating what the problem
was, so that the user (or some automation) can make educated decision
on how to proceed.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2125782/+subscriptions