yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #82452
[Bug 1780225] Re: Libvirt error when using --max > 1 with vGPU
In Stein, we merged the ability to have multiple Resource Providers, each of them being a pGPU.
In Ussuri, we accepted to have a specific vGPU type per pGPU.
Now, I tested the above behaviour with https://review.opendev.org/723858
and it works now, unless you ask for a specific total capacity.
I'll close this bug that was only for libvirt vGPUs and please look at
https://bugs.launchpad.net/nova/+bug/1874664 for the related issue.
** Changed in: nova
Status: Confirmed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1780225
Title:
Libvirt error when using --max > 1 with vGPU
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Description
===========
Using devstack Rocky with a NVIDIA Tesla M10 + GRID driver on RHEL 7.5.
Profile used in nova: nvidia-35 (num_heads=2, frl_config=45, framebuffer=512M, max_resolution=2560x1600, max_instance=16)
I can launch instances one by one without any issue.
I cannot use --max paramater greater than 1.
Expected result
===============
Be able to use --max parameter with vGPU
Steps to reproduce
==================
[root@host2 ~]# openstack server list
+--------------------------------------+-----------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------+--------+---------------------------------------------------------------------+--------+--------+
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+-----------+--------+---------------------------------------------------------------------+--------+--------+
[root@host2 ~]# openstack server create --flavor vgpu --image rhel75 --key-name myself --max 2 instance
+-------------------------------------+-----------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | iNiFmD6kNszw |
| config_drive | |
| created | 2018-07-05T09:19:25Z |
| flavor | vgpu (vgpu1) |
| hostId | |
| id | 5a8691a8-a18c-4c71-8541-be00f224fd82 |
| image | rhel75 (e63a49a8-4568-4b57-9d12-1eb1ede28438) |
| key_name | myself |
| name | instance-1 |
| progress | 0 |
| project_id | fdea2c781db74ae593c5e9501e9290cc |
| properties | |
| security_groups | name='default' |
| status | BUILD |
| updated | 2018-07-05T09:19:25Z |
| user_id | 130a646fc362418f8b62ac11f1154942 |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------+
[root@host2 ~]# openstack server list
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| 515f0d21-6ab8-406e-9889-177718c79e61 | instance-2 | ERROR | | rhel75 | vgpu |
| 5a8691a8-a18c-4c71-8541-be00f224fd82 | instance-1 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe1f:d7a, 10.0.0.11 | rhel75 | vgpu |
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
[root@host2 ~]# openstack server create --flavor vgpu --image rhel75 --key-name myself --max 1 instance
+-------------------------------------+-----------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | MGxmntECb22S |
| config_drive | |
| created | 2018-07-05T09:19:45Z |
| flavor | vgpu (vgpu1) |
| hostId | |
| id | 24df940f-500b-44db-88e2-a6fd1fe915c0 |
| image | rhel75 (e63a49a8-4568-4b57-9d12-1eb1ede28438) |
| key_name | myself |
| name | instance |
| progress | 0 |
| project_id | fdea2c781db74ae593c5e9501e9290cc |
| properties | |
| security_groups | name='default' |
| status | BUILD |
| updated | 2018-07-05T09:19:45Z |
| user_id | 130a646fc362418f8b62ac11f1154942 |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------+
[root@host2 ~]# openstack server list
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| 24df940f-500b-44db-88e2-a6fd1fe915c0 | instance | BUILD | private=fda2:f16f:605e:0:f816:3eff:fefd:8796, 10.0.0.7 | rhel75 | vgpu |
| 515f0d21-6ab8-406e-9889-177718c79e61 | instance-2 | ERROR | | rhel75 | vgpu |
| 5a8691a8-a18c-4c71-8541-be00f224fd82 | instance-1 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe1f:d7a, 10.0.0.11 | rhel75 | vgpu |
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
[root@host2 ~]# openstack server list
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| 24df940f-500b-44db-88e2-a6fd1fe915c0 | instance | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fefd:8796, 10.0.0.7 | rhel75 | vgpu |
| 515f0d21-6ab8-406e-9889-177718c79e61 | instance-2 | ERROR | | rhel75 | vgpu |
| 5a8691a8-a18c-4c71-8541-be00f224fd82 | instance-1 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe1f:d7a, 10.0.0.11 | rhel75 | vgpu |
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
[root@host2 ~]# openstack server create --flavor vgpu --image rhel75 --key-name myself --max 1 instance
+-------------------------------------+-----------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | 69crZEFxBT9j |
| config_drive | |
| created | 2018-07-05T09:21:43Z |
| flavor | vgpu (vgpu1) |
| hostId | |
| id | 4a172549-91c2-46cc-8895-cd2fcbb19430 |
| image | rhel75 (e63a49a8-4568-4b57-9d12-1eb1ede28438) |
| key_name | myself |
| name | instance |
| progress | 0 |
| project_id | fdea2c781db74ae593c5e9501e9290cc |
| properties | |
| security_groups | name='default' |
| status | BUILD |
| updated | 2018-07-05T09:21:43Z |
| user_id | 130a646fc362418f8b62ac11f1154942 |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------+
[root@host2 ~]# openstack server list
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| 4a172549-91c2-46cc-8895-cd2fcbb19430 | instance | BUILD | | rhel75 | vgpu |
| 24df940f-500b-44db-88e2-a6fd1fe915c0 | instance | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fefd:8796, 10.0.0.7 | rhel75 | vgpu |
| 515f0d21-6ab8-406e-9889-177718c79e61 | instance-2 | ERROR | | rhel75 | vgpu |
| 5a8691a8-a18c-4c71-8541-be00f224fd82 | instance-1 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe1f:d7a, 10.0.0.11 | rhel75 | vgpu |
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
[root@host2 ~]# openstack server list
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
| 4a172549-91c2-46cc-8895-cd2fcbb19430 | instance | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe7d:a6d8, 10.0.0.4 | rhel75 | vgpu |
| 24df940f-500b-44db-88e2-a6fd1fe915c0 | instance | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fefd:8796, 10.0.0.7 | rhel75 | vgpu |
| 515f0d21-6ab8-406e-9889-177718c79e61 | instance-2 | ERROR | | rhel75 | vgpu |
| 5a8691a8-a18c-4c71-8541-be00f224fd82 | instance-1 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fe1f:d7a, 10.0.0.11 | rhel75 | vgpu |
| 56aeda96-f193-49fc-914d-8b507674eb16 | instance0 | ACTIVE | private=fda2:f16f:605e:0:f816:3eff:fef2:8e20, 10.0.0.12, 172.24.4.2 | rhel75 | vgpu |
+--------------------------------------+------------+--------+---------------------------------------------------------------------+--------+--------+
- Nova error:
{u'message': u'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance de2a5078-6acd-4ffd-9895-d664adb42296.', u'code': 500, u'details': u' File "/opt/stack/nova/nova/conductor/manager.py", line 579, in build_instances\n raise exception.MaxRetriesExceeded(reason=msg)\n', u'created': u'2018-07-05T07:32:52Z'} |
- Libvirt error:
messages:Jul 5 03:32:51 host2 nova-compute: #033[00m: libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/25f56195-9719-4380-a90b-084d64307e06 is in use by driver QEMU, domain instance-00000019
messages:Jul 5 03:32:51 host2 nova-compute: #033[01;31mERROR nova.virt.libvirt.driver [#033[01;36mNone req-e04582ed-de22-4bfa-9253-92e687328a4c #033[00;36mservice nova#033[01;31m] #033[01;35m[instance: de2a5078-6acd-4ffd-9895-d664adb42296] #033[01;31mFailed to start libvirt guest#033[00m: libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/25f56195-9719-4380-a90b-084d64307e06 is in use by driver QEMU, domain instance-00000019
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1780225/+subscriptions
References