yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81399
[Bug 1860555] [NEW] PCI passthrough reschedule race condition
Public bug reported:
Steps to reproduce
------------------
Create multiple instances concurrently using a flavor with a PCI
passthrough request (--property
"pci_passthrough:alias"="<alias>:<count>"), and a scheduler hint with
some anti-affinity constraint.
Expected result
---------------
The instances are created successfully, and each have the expected
number of PCI devices attached.
Actual result
-------------
Sometimes, instances may fail during creation, or may be created with
more PCI devices than requested.
Environment
-----------
Nova 18.2.2 (rocky), CentOS 7, libvirt, deployed by kolla-ansible.
Analysis
--------
If an instance with PCI passthrough devices is rescheduled (e.g. due to
affinity violation), the instance can end up with extra PCI devices attached.
If the devices selected on the original and subsequent compute nodes have the
same address, the instance will fail to create, with the following error:
libvirtError: internal error: Device 0000:89:00.0 is already in use
However, if the devices are different, and all available on the first and
second compute nodes, the VM may end up with additional hostdevs.
On investigation, when the node is rescheduled, the instance object passed to
the conductor RPC API contains the PCI devices that should have been freed.
This is because the claim object holds a clone of the instance that is used to
perform the abort on failure [1][2], and the PCI devices removed from its list are not
reflected in the original object. There is a secondary issue that the PCI
manager was not passing through the instance to the PCI object's free() method
in all cases [3], resulting in the PCI device not being removed from the
instance.pci_devices list.
I have two alternative fixes for this issue, but they will need a little
time to work their way out of an organisation. Essentially:
1. pass the original instance (not the clone) to the abort function in the Claim.
2. refresh the instance from DB when rescheduling
The former is a more general solution, but I don't know the reasons for
using a clone in the first place. The second works for reschedules, but
may leave a hole for resize or migration. I haven't reproduced the issue
in those cases but it seems possible that it would be present.
[1] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L64
[2] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L83
[3] https://opendev.org/openstack/nova/src/branch/master/nova/pci/manager.py#L309
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1860555
Title:
PCI passthrough reschedule race condition
Status in OpenStack Compute (nova):
New
Bug description:
Steps to reproduce
------------------
Create multiple instances concurrently using a flavor with a PCI
passthrough request (--property
"pci_passthrough:alias"="<alias>:<count>"), and a scheduler hint with
some anti-affinity constraint.
Expected result
---------------
The instances are created successfully, and each have the expected
number of PCI devices attached.
Actual result
-------------
Sometimes, instances may fail during creation, or may be created with
more PCI devices than requested.
Environment
-----------
Nova 18.2.2 (rocky), CentOS 7, libvirt, deployed by kolla-ansible.
Analysis
--------
If an instance with PCI passthrough devices is rescheduled (e.g. due to
affinity violation), the instance can end up with extra PCI devices attached.
If the devices selected on the original and subsequent compute nodes have the
same address, the instance will fail to create, with the following error:
libvirtError: internal error: Device 0000:89:00.0 is already in use
However, if the devices are different, and all available on the first and
second compute nodes, the VM may end up with additional hostdevs.
On investigation, when the node is rescheduled, the instance object passed to
the conductor RPC API contains the PCI devices that should have been freed.
This is because the claim object holds a clone of the instance that is used to
perform the abort on failure [1][2], and the PCI devices removed from its list are not
reflected in the original object. There is a secondary issue that the PCI
manager was not passing through the instance to the PCI object's free() method
in all cases [3], resulting in the PCI device not being removed from the
instance.pci_devices list.
I have two alternative fixes for this issue, but they will need a
little time to work their way out of an organisation. Essentially:
1. pass the original instance (not the clone) to the abort function in the Claim.
2. refresh the instance from DB when rescheduling
The former is a more general solution, but I don't know the reasons
for using a clone in the first place. The second works for
reschedules, but may leave a hole for resize or migration. I haven't
reproduced the issue in those cases but it seems possible that it
would be present.
[1] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L64
[2] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L83
[3] https://opendev.org/openstack/nova/src/branch/master/nova/pci/manager.py#L309
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1860555/+subscriptions
Follow ups