yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73701
[Bug 1780441] [NEW] Rebuild does not respect number of PCIe devices
Public bug reported:
Description
===========
When rebuilding a instance with a GPUs attached it may get additional GPUs if there are free available. This number can vary between rebuilds, most of the rebuilds it receive the same amount of GPUs as before the latest rebuild.
Step to reproduce
=================
$ openstack flavor show 5fd13401-7daa-464d-acf1-432d29a3dd92
+----------------------------+-----------------------------------------------+
| Field | Value |
+----------------------------+-----------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| disk | 80 |
| id | 5fd13401-7daa-464d-acf1-432d29a3dd92 |
| name | gpu.2.1gpu |
| os-flavor-access:is_public | True |
| properties | gpu_m10='true', pci_passthrough:alias='M10:1' |
| ram | 5000 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 2 |
+----------------------------+-----------------------------------------------+
$ openstack server create my-gpu-instace --image CentOS-7 --network my-
project-network --flavor 5fd13401-7daa-464d-acf1-432d29a3dd92 --key-name
my-key --security-group default
On the gpu node:
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio
<driver name='vfio'/>
$ openstack server rebuild 29d5a9ba-0829-4e33-9d1c-4ee66b55a940
On the gpu node:
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio
<driver name='vfio'/>
<driver name='vfio'/>
<driver name='vfio'/>
* The database:
MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940';
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
3 rows in set (0.01 sec)
* After some additional rebuilds (5-10), 4 GPUs in the database but only
one in visible from virsh
MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940';
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 12:25:19 | NULL | 0 | 21 | 36 | 0000:dc:00.0 | 13bd | 10de | type-PCI | pci_0000_dc_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 1 | NULL |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
4 rows in set (0.00 sec)
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep "vfio\|uuid>"
<uuid>29d5a9ba-0829-4e33-9d1c-4ee66b55a940</uuid>
<driver name='vfio'/>
Expected result
===============
The instance is launched with only one GPGPU after every rebuild.
Actual result
=============
The instance get rebuilt with unexpected amount of GPGPUs most often the same amount of GPGPU as it had before the last rebuilt. I have observed 1-3 GPGPU. This has been tested on system with 3 NVIDIA Tesla V100, 4 NVIDIA Tesla P100, and a system with two physical NVIDIA M10 (system sees does as 8 GPGPUs, 4 per card).
Environment
===========
[root@g1 ~]# rpm -qa |grep nova
openstack-nova-common-14.1.0-1.el7.noarch
openstack-nova-compute-14.1.0-1.el7.noarch
python2-novaclient-6.0.2-1.el7.noarch
python-nova-14.1.0-1.el7.noarch
[root@g1 ~]# rpm -qa |grep -i 'kvm\|qemu\|libvirt' |grep -v daemon
libvirt-client-3.9.0-14.el7_5.5.x86_64
qemu-kvm-ev-2.10.0-21.el7_5.3.1.x86_64
libvirt-python-3.9.0-1.el7.x86_64
qemu-img-ev-2.10.0-21.el7_5.3.1.x86_64
qemu-kvm-common-ev-2.10.0-21.el7_5.3.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-libs-3.9.0-14.el7_5.5.x86_64
[root@g1 ~]# rbd -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
[root@g1 ~]# rpm -qa openstack-neutron*
openstack-neutron-common-9.4.1-1.el7.noarch
openstack-neutron-9.4.1-1.el7.noarch
openstack-neutron-linuxbridge-9.4.1-1.el7.noarch
openstack-neutron-ml2-9.4.1-1.el7.noarch
Logs & Configs
==============
I don't know what config/log files would be most useful and I won't put
a dump online, but I'm sure that I can grep for stuff if necessary.
[root@devel1 ~]# grep ^pci_alias /etc/nova/nova.conf
pci_alias={"vendor_id":"10de","product_id":"13bd","name":"M10"}
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1780441
Title:
Rebuild does not respect number of PCIe devices
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
When rebuilding a instance with a GPUs attached it may get additional GPUs if there are free available. This number can vary between rebuilds, most of the rebuilds it receive the same amount of GPUs as before the latest rebuild.
Step to reproduce
=================
$ openstack flavor show 5fd13401-7daa-464d-acf1-432d29a3dd92
+----------------------------+-----------------------------------------------+
| Field | Value |
+----------------------------+-----------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| disk | 80 |
| id | 5fd13401-7daa-464d-acf1-432d29a3dd92 |
| name | gpu.2.1gpu |
| os-flavor-access:is_public | True |
| properties | gpu_m10='true', pci_passthrough:alias='M10:1' |
| ram | 5000 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 2 |
+----------------------------+-----------------------------------------------+
$ openstack server create my-gpu-instace --image CentOS-7 --network
my-project-network --flavor 5fd13401-7daa-464d-acf1-432d29a3dd92
--key-name my-key --security-group default
On the gpu node:
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio
<driver name='vfio'/>
$ openstack server rebuild 29d5a9ba-0829-4e33-9d1c-4ee66b55a940
On the gpu node:
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio
<driver name='vfio'/>
<driver name='vfio'/>
<driver name='vfio'/>
* The database:
MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940';
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
3 rows in set (0.01 sec)
* After some additional rebuilds (5-10), 4 GPUs in the database but only one in visible from virsh
MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940';
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
| 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL |
| 2018-06-27 10:54:44 | 2018-07-06 12:25:19 | NULL | 0 | 21 | 36 | 0000:dc:00.0 | 13bd | 10de | type-PCI | pci_0000_dc_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 1 | NULL |
+---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+
4 rows in set (0.00 sec)
[root@g1 ~]# virsh dumpxml instance-0001b22e |grep "vfio\|uuid>"
<uuid>29d5a9ba-0829-4e33-9d1c-4ee66b55a940</uuid>
<driver name='vfio'/>
Expected result
===============
The instance is launched with only one GPGPU after every rebuild.
Actual result
=============
The instance get rebuilt with unexpected amount of GPGPUs most often the same amount of GPGPU as it had before the last rebuilt. I have observed 1-3 GPGPU. This has been tested on system with 3 NVIDIA Tesla V100, 4 NVIDIA Tesla P100, and a system with two physical NVIDIA M10 (system sees does as 8 GPGPUs, 4 per card).
Environment
===========
[root@g1 ~]# rpm -qa |grep nova
openstack-nova-common-14.1.0-1.el7.noarch
openstack-nova-compute-14.1.0-1.el7.noarch
python2-novaclient-6.0.2-1.el7.noarch
python-nova-14.1.0-1.el7.noarch
[root@g1 ~]# rpm -qa |grep -i 'kvm\|qemu\|libvirt' |grep -v daemon
libvirt-client-3.9.0-14.el7_5.5.x86_64
qemu-kvm-ev-2.10.0-21.el7_5.3.1.x86_64
libvirt-python-3.9.0-1.el7.x86_64
qemu-img-ev-2.10.0-21.el7_5.3.1.x86_64
qemu-kvm-common-ev-2.10.0-21.el7_5.3.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-libs-3.9.0-14.el7_5.5.x86_64
[root@g1 ~]# rbd -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
[root@g1 ~]# rpm -qa openstack-neutron*
openstack-neutron-common-9.4.1-1.el7.noarch
openstack-neutron-9.4.1-1.el7.noarch
openstack-neutron-linuxbridge-9.4.1-1.el7.noarch
openstack-neutron-ml2-9.4.1-1.el7.noarch
Logs & Configs
==============
I don't know what config/log files would be most useful and I won't
put a dump online, but I'm sure that I can grep for stuff if
necessary.
[root@devel1 ~]# grep ^pci_alias /etc/nova/nova.conf
pci_alias={"vendor_id":"10de","product_id":"13bd","name":"M10"}
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1780441/+subscriptions
Follow ups