← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1969496] [NEW] booting with PCI device fails: Attempt to consume PCI device xxx from empty pool

 

Public bug reported:

We saw in the field that the pci_devices table can end up in inconsistent state after a compute node HW failure and re-deployment. There could be dependent devices where the parent PF is in available state while the children VFs are in unavailable state. (Before the HW fault the PF was allocated hence the VFs was marked unavailable).
    
In this state this PF is still schedulable but during the PCI claim the handling of dependent devices in the PCI tracker will fail with the error: "Attempt to consume PCI device XXX from empty pool".
    
The reason of the failure is that when the PF is claimed, all the children VFs are marked unavailable. But if the VF is already unavailable such step fails.

There is no reproducer found so far that generates the inconsistent
state. (We tried whitelist reconfiguration, evacuation, VM delete while
the compute was down) But recovering from the inconsistency should be
possible.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1969496

Title:
  booting with PCI device fails: Attempt to consume PCI device xxx from
  empty pool

Status in OpenStack Compute (nova):
  New

Bug description:
  We saw in the field that the pci_devices table can end up in inconsistent state after a compute node HW failure and re-deployment. There could be dependent devices where the parent PF is in available state while the children VFs are in unavailable state. (Before the HW fault the PF was allocated hence the VFs was marked unavailable).
      
  In this state this PF is still schedulable but during the PCI claim the handling of dependent devices in the PCI tracker will fail with the error: "Attempt to consume PCI device XXX from empty pool".
      
  The reason of the failure is that when the PF is claimed, all the children VFs are marked unavailable. But if the VF is already unavailable such step fails.

  There is no reproducer found so far that generates the inconsistent
  state. (We tried whitelist reconfiguration, evacuation, VM delete
  while the compute was down) But recovering from the inconsistency
  should be possible.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1969496/+subscriptions



Follow ups