← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2115905] [NEW] Cannot delete a VM that uses a PCI device where the matching device_spec is removed

 

Public bug reported:

If a device_spec is removed while the device matching it is in use nova-
compute will raise a warning at startup. Our doc says:

https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#pci-
tracking-in-placement

  Reconfiguring the PCI devices on the hypervisor or changing the
pci.device_spec configuration option and restarting the nova-compute
service is supported in the following cases:

  * new devices are added

  * devices without allocation are removed

  Removing a device that has allocations is not supported. If a device
having any allocation is removed then the nova-compute service will keep
the device and the allocation exists in the nova DB and in placement and
logs a warning. If a device with any allocation is reconfigured in a way
that an allocated PF is removed and VFs from the same PF is configured
(or vice versa) then nova-compute will refuse to start as it would
create a situation where both the PF and its VFs are made available for
consumption.

The actual warning says:

  Unable to remove device with status 'allocated' and ownership
818f2460-61ff-449e-b3c4-9e3626e01645 because of PCI device
1:0000:07:10.2 is allocated instead of ['available', 'unavailable',
'unclaimable']. Check your [pci]device_spec configuration to make sure
this allocated device is whitelisted. If you have removed the device
from the whitelist intentionally or the device is no longer available on
the host you will need to delete the server or migrate it to another
host to silence this warning.: nova.exception.PciDeviceInvalidStatus:
PCI device 1:0000:07:10.2 is allocated instead of ['available',
'unavailable', 'unclaimable']


But the suggestion of delete the server (and probably the other to migrate it) is wrong. Trying to delete the server in this state causing the VM to go to ERROR state and cannot be deleted.

The only way to delete this VM then is to manually delete the placement
allocation of the VM first, then delete the VM. This is pretty
dangerous, so it should not be suggested.

The clean way to avoid this is not to remove the dev_spec whil the
device is in use. Or if it is removed then put it back, delete the VM
and then remove the dev_spec. However if this problem is triggered not
by a manual reconfiguration of the dev_spec but a device disappearing
from the hypervisor, putting it back to delete the VM is not an option.


See the full reproduction steps and stack traces are in:
https://paste.opendev.org/show/bn0uXg4JqOLPUKg4AKd5/

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: pci placement

** Tags added: pci placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2115905

Title:
  Cannot delete a VM that uses a PCI device where the matching
  device_spec is removed

Status in OpenStack Compute (nova):
  New

Bug description:
  If a device_spec is removed while the device matching it is in use
  nova-compute will raise a warning at startup. Our doc says:

  https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#pci-
  tracking-in-placement

    Reconfiguring the PCI devices on the hypervisor or changing the
  pci.device_spec configuration option and restarting the nova-compute
  service is supported in the following cases:

    * new devices are added

    * devices without allocation are removed

    Removing a device that has allocations is not supported. If a device
  having any allocation is removed then the nova-compute service will
  keep the device and the allocation exists in the nova DB and in
  placement and logs a warning. If a device with any allocation is
  reconfigured in a way that an allocated PF is removed and VFs from the
  same PF is configured (or vice versa) then nova-compute will refuse to
  start as it would create a situation where both the PF and its VFs are
  made available for consumption.

  The actual warning says:

    Unable to remove device with status 'allocated' and ownership
  818f2460-61ff-449e-b3c4-9e3626e01645 because of PCI device
  1:0000:07:10.2 is allocated instead of ['available', 'unavailable',
  'unclaimable']. Check your [pci]device_spec configuration to make sure
  this allocated device is whitelisted. If you have removed the device
  from the whitelist intentionally or the device is no longer available
  on the host you will need to delete the server or migrate it to
  another host to silence this warning.:
  nova.exception.PciDeviceInvalidStatus: PCI device 1:0000:07:10.2 is
  allocated instead of ['available', 'unavailable', 'unclaimable']

  
  But the suggestion of delete the server (and probably the other to migrate it) is wrong. Trying to delete the server in this state causing the VM to go to ERROR state and cannot be deleted.

  The only way to delete this VM then is to manually delete the
  placement allocation of the VM first, then delete the VM. This is
  pretty dangerous, so it should not be suggested.

  The clean way to avoid this is not to remove the dev_spec whil the
  device is in use. Or if it is removed then put it back, delete the VM
  and then remove the dev_spec. However if this problem is triggered not
  by a manual reconfiguration of the dev_spec but a device disappearing
  from the hypervisor, putting it back to delete the VM is not an
  option.


  See the full reproduction steps and stack traces are in:
  https://paste.opendev.org/show/bn0uXg4JqOLPUKg4AKd5/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2115905/+subscriptions