yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96380
[Bug 2115905] Related fix merged to nova (master)
Reviewed: https://review.opendev.org/c/openstack/nova/+/954613
Committed: https://opendev.org/openstack/nova/commit/f37cdf0c4182103ad81dbf39188ff39955da3850
Submitter: "Zuul (22348)"
Branch: master
commit f37cdf0c4182103ad81dbf39188ff39955da3850
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date: Tue Jul 8 16:55:55 2025 +0200
[PCI tracker]Remove non configured devs when freed
The PCI tracker handles the case when a device spec is removed from
the configuration while a device is still being allocated. It keeps the
device until the VM is deleted to avoid inconsistencies.
However the full removal of such a device needs not just the VM deletion,
but also a nova-compute restart. The device tracker just frees the
device during VM deletion but does not removed them until the next
nova-compute startup. This allows the device to be re-allocated by
another VM even though the device is not allowed by a device_spec.
This change adds yet another in memory dict to the pci tracker to track
these devices that are only kept until they are freed. Then during
free() this list is consulted and if the device is in the list then the
device is marked for removal as well.
This kills two birds with one stone:
* We prevent the re-allocation of the device as the state of the device
will be set to REMOVED not AVAILABLE during VM deletion.
* As PCI in Placement relies on the state of the device to decide what
to track in placement, this change makes sure that a device that
needs to be removed, is now removed from placement too. Note that we have
another bug that prevents this removal for now. But at least the
reproducers of that bug now starts to behave the same regardless of
how many device belongs to the same RP in placement.
Related-Bug: #2115905
Change-Id: I63c8fb2669a3c6b3adb77d210c0f9b39d3657c80
Signed-off-by: Balazs Gibizer <gibi@xxxxxxxxxx>
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2115905
Title:
Cannot delete a VM that uses a PCI device where the matching
device_spec is removed if PCI in Placement is enabled
Status in OpenStack Compute (nova):
Fix Released
Bug description:
If a device_spec is removed while the device matching it is in use
nova-compute will raise a warning at startup. Our doc says:
https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#pci-
tracking-in-placement
Reconfiguring the PCI devices on the hypervisor or changing the
pci.device_spec configuration option and restarting the nova-compute
service is supported in the following cases:
* new devices are added
* devices without allocation are removed
Removing a device that has allocations is not supported. If a device
having any allocation is removed then the nova-compute service will
keep the device and the allocation exists in the nova DB and in
placement and logs a warning. If a device with any allocation is
reconfigured in a way that an allocated PF is removed and VFs from the
same PF is configured (or vice versa) then nova-compute will refuse to
start as it would create a situation where both the PF and its VFs are
made available for consumption.
The actual warning says:
Unable to remove device with status 'allocated' and ownership
818f2460-61ff-449e-b3c4-9e3626e01645 because of PCI device
1:0000:07:10.2 is allocated instead of ['available', 'unavailable',
'unclaimable']. Check your [pci]device_spec configuration to make sure
this allocated device is whitelisted. If you have removed the device
from the whitelist intentionally or the device is no longer available
on the host you will need to delete the server or migrate it to
another host to silence this warning.:
nova.exception.PciDeviceInvalidStatus: PCI device 1:0000:07:10.2 is
allocated instead of ['available', 'unavailable', 'unclaimable']
But the suggestion of delete the server (and probably the other to migrate it) is wrong. Trying to delete the server in this state causing the VM to go to ERROR state and cannot be deleted.
The only way to delete this VM then is to manually delete the
placement allocation of the VM first, then delete the VM. This is
pretty dangerous, so it should not be suggested.
The clean way to avoid this is not to remove the dev_spec whil the
device is in use. Or if it is removed then put it back, delete the VM
and then remove the dev_spec. However if this problem is triggered not
by a manual reconfiguration of the dev_spec but a device disappearing
from the hypervisor, putting it back to delete the VM is not an
option.
See the full reproduction steps and stack traces are in:
https://paste.opendev.org/show/bn0uXg4JqOLPUKg4AKd5/
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2115905/+subscriptions
References