yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76862
[Bug 1633120] Re: Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new instance
Reviewed: https://review.openstack.org/626381
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=26c41eccade6412f61f9a8721d853b545061adcc
Submitter: Zuul
Branch: master
commit 26c41eccade6412f61f9a8721d853b545061adcc
Author: Sean Mooney <work@xxxxxxxxxxxxxxx>
Date: Wed Dec 19 19:40:05 2018 +0000
PCI: do not force remove allocated devices
In the ocata release the pci_passthrough_whitelist
was moved from the [DEFAULT] section of the nova.conf
to the [pci] section and renamed to passthrough_whitelist.
On upgrading if the operator chooses to migrate the config
value to the new section it is not uncommon
to forget to rename the config value.
Similarly if an operator is updateing the whitelist and
mistypes the value it can also lead to the whitelist
being ignored.
As a result of either error the nova compute agent
would delete all database entries for a host regardless of
if the pci device was in use by an instance. If this occurs
the only recorse for an operator is to delete and recreate
the guest on that host after correcting the error or manually
restore the database to backup or otherwise consistent state.
This change alters the _set_hvdevs function to not force
remove allocated or claimed devices if they are no longer
present in the pci whitelist.
Closes-Bug: #1633120
Change-Id: I6e871311a0fa10beaf601ca6912b4a33ba4094e0
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1633120
Title:
Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new
instance
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) ocata series:
Triaged
Status in OpenStack Compute (nova) pike series:
Triaged
Status in OpenStack Compute (nova) queens series:
Triaged
Status in OpenStack Compute (nova) rocky series:
Triaged
Bug description:
Upon trying to create VM instance (Say A) with one QAT VF, it fails with the following error i.e., “Requested operation is not valid: PCI device 0000:88:04.7 is in use by driver QEMU, domain instance-00000081”. Please note that, PCI device 0000:88:04.7 is already being assigned to another VM (Say B) . We have installed openstack-mitaka release on CentO7 system. It has two Intel QAT devices. There are 32 VF devices available per QAT Device/DH895xCC device Out of 64 VFs, only 8 VFs are allocated (to VM instances) and rest should be available.
But the nova scheduler tries to assign an already-in-use SRIOV VF to a new instance and instance fails. It appears that the nova database is not tracking which VF's have already been taken. But if I shut down VM B instance, then other instance VM A boots up and vice-versa. Note that, both the VM instances cannot run simultaneously because of the aforesaid issue.
We should always be able to create as many instances with the
requested PCI devices as there are available VFs.
Please feel free to let me know if additional information is needed.
Can anyone please suggest why it tries to assign same PCI device which
has been assigned already? Is there any way to resolve this issue?
Thank you in advance for your support and help.
[root@localhost ~(keystone_admin)]# lspci -d:435
83:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
88:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
[root@localhost ~(keystone_admin)]#
[root@localhost ~(keystone_admin)]# lspci -d:443 | grep "QAT Virtual Function" | wc -l
64
[root@localhost ~(keystone_admin)]#
[root@localhost ~(keystone_admin)]# mysql -u root nova -e "SELECT hypervisor_hostname, address, instance_uuid, status FROM pci_devices JOIN compute_nodes oncompute_nodes.id=compute_node_id" | grep 0000:88:04.7
localhost 0000:88:04.7 e10a76f3-e58e-4071-a4dd-7a545e8000de allocated
localhost 0000:88:04.7 c3dbac90-198d-4150-ba0f-a80b912d8021 allocated
localhost 0000:88:04.7 c7f6adad-83f0-4881-b68f-6d154d565ce3 allocated
localhost.nfv.benunets.com 0000:88:04.7 0c3c11a5-f9a4-4f0d-b120-40e4dde843d4 allocated
[root@localhost ~(keystone_admin)]#
[root@localhost ~(keystone_admin)]# grep -r e10a76f3-e58e-4071-a4dd-7a545e8000de /etc/libvirt/qemu
/etc/libvirt/qemu/instance-00000081.xml: <uuid>e10a76f3-e58e-4071-a4dd-7a545e8000de</uuid>
/etc/libvirt/qemu/instance-00000081.xml: <entry name='uuid'>e10a76f3-e58e-4071-a4dd-7a545e8000de</entry>
/etc/libvirt/qemu/instance-00000081.xml: <source file='/var/lib/nova/instances/e10a76f3-e58e-4071-a4dd-7a545e8000de/disk'/>
/etc/libvirt/qemu/instance-00000081.xml: <source path='/var/lib/nova/instances/e10a76f3-e58e-4071-a4dd-7a545e8000de/console.log'/>
/etc/libvirt/qemu/instance-00000081.xml: <source path='/var/lib/nova/instances/e10a76f3-e58e-4071-a4dd-7a545e8000de/console.log'/>
[root@localhost ~(keystone_admin)]#
[root@localhost ~(keystone_admin)]# grep -r 0c3c11a5-f9a4-4f0d-b120-40e4dde843d4 /etc/libvirt/qemu
/etc/libvirt/qemu/instance-000000ab.xml: <uuid>0c3c11a5-f9a4-4f0d-b120-40e4dde843d4</uuid>
/etc/libvirt/qemu/instance-000000ab.xml: <entry name='uuid'>0c3c11a5-f9a4-4f0d-b120-40e4dde843d4</entry>
/etc/libvirt/qemu/instance-000000ab.xml: <source file='/var/lib/nova/instances/0c3c11a5-f9a4-4f0d-b120-40e4dde843d4/disk'/>
/etc/libvirt/qemu/instance-000000ab.xml: <source path='/var/lib/nova/instances/0c3c11a5-f9a4-4f0d-b120-40e4dde843d4/console.log'/>
/etc/libvirt/qemu/instance-000000ab.xml: <source path='/var/lib/nova/instances/0c3c11a5-f9a4-4f0d-b120-40e4dde843d4/console.log'/>
[root@localhost ~(keystone_admin)]#
On the controller, , it appears there are duplicate PCI device entries in the Database:
MariaDB [nova]> select hypervisor_hostname,address,count(*) from pci_devices JOIN compute_nodes on compute_nodes.id=compute_node_id group by hypervisor_hostname,address having count(*) > 1;
+---------------------+--------------+----------+
| hypervisor_hostname | address | count(*) |
+---------------------+--------------+----------+
| localhost | 0000:05:00.0 | 3 |
| localhost | 0000:05:00.1 | 3 |
| localhost | 0000:83:01.0 | 3 |
| localhost | 0000:83:01.1 | 3 |
| localhost | 0000:83:01.2 | 3 |
| localhost | 0000:83:01.3 | 3 |
| localhost | 0000:83:01.4 | 3 |
| localhost | 0000:83:01.5 | 3 |
| localhost | 0000:83:01.6 | 3 |
| localhost | 0000:83:01.7 | 3 |
| localhost | 0000:83:02.0 | 3 |
| localhost | 0000:83:02.1 | 3 |
| localhost | 0000:83:02.2 | 3 |
| localhost | 0000:83:02.3 | 3 |
| localhost | 0000:83:02.4 | 3 |
| localhost | 0000:83:02.5 | 3 |
| localhost | 0000:83:02.6 | 3 |
| localhost | 0000:83:02.7 | 3 |
| localhost | 0000:83:03.0 | 3 |
| localhost | 0000:83:03.1 | 3 |
| localhost | 0000:83:03.2 | 3 |
| localhost | 0000:83:03.3 | 3 |
| localhost | 0000:83:03.4 | 3 |
| localhost | 0000:83:03.5 | 3 |
| localhost | 0000:83:03.6 | 3 |
| localhost | 0000:83:03.7 | 3 |
| localhost | 0000:83:04.0 | 3 |
| localhost | 0000:83:04.1 | 3 |
| localhost | 0000:83:04.2 | 3 |
| localhost | 0000:83:04.3 | 3 |
| localhost | 0000:83:04.4 | 3 |
| localhost | 0000:83:04.5 | 3 |
| localhost | 0000:83:04.6 | 3 |
| localhost | 0000:83:04.7 | 3 |
| localhost | 0000:88:01.0 | 3 |
| localhost | 0000:88:01.1 | 3 |
| localhost | 0000:88:01.2 | 3 |
| localhost | 0000:88:01.3 | 3 |
| localhost | 0000:88:01.4 | 3 |
| localhost | 0000:88:01.5 | 3 |
| localhost | 0000:88:01.6 | 3 |
| localhost | 0000:88:01.7 | 3 |
| localhost | 0000:88:02.0 | 3 |
| localhost | 0000:88:02.1 | 3 |
| localhost | 0000:88:02.2 | 3 |
| localhost | 0000:88:02.3 | 3 |
| localhost | 0000:88:02.4 | 3 |
| localhost | 0000:88:02.5 | 3 |
| localhost | 0000:88:02.6 | 3 |
| localhost | 0000:88:02.7 | 3 |
| localhost | 0000:88:03.0 | 3 |
| localhost | 0000:88:03.1 | 3 |
| localhost | 0000:88:03.2 | 3 |
| localhost | 0000:88:03.3 | 3 |
| localhost | 0000:88:03.4 | 3 |
| localhost | 0000:88:03.5 | 3 |
| localhost | 0000:88:03.6 | 3 |
| localhost | 0000:88:03.7 | 3 |
| localhost | 0000:88:04.0 | 3 |
| localhost | 0000:88:04.1 | 3 |
| localhost | 0000:88:04.2 | 3 |
| localhost | 0000:88:04.3 | 3 |
| localhost | 0000:88:04.4 | 3 |
| localhost | 0000:88:04.5 | 3 |
| localhost | 0000:88:04.6 | 3 |
| localhost | 0000:88:04.7 | 3 |
+---------------------+--------------+----------+
66 rows in set (0.00 sec)
MariaDB [nova]>
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1633120/+subscriptions
References