yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83902
[Bug 1893904] Re: Placement is not updated if a VGPU is re-created on a new GPU upon host reboot
This was a known issue that should have been fixed by
https://review.opendev.org/#/c/715489/ which was merged during the
Ussuri timeframe.
For being clear, since mdevs disappear when you reboot, Nova now tries
to find the already provided GPU by looking at the guest XML.
Closing this bug as the master branch no longer has the bug, but please
reopen it in case you can reproduce the problem with master.
** Changed in: nova
Status: New => Won't Fix
** Changed in: nova
Importance: Undecided => Low
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1893904
Title:
Placement is not updated if a VGPU is re-created on a new GPU upon
host reboot
Status in OpenStack Compute (nova):
Won't Fix
Bug description:
First of all, I'm not really sure which project to "blame" for this
bug, but here's the problem:
When you reboot a compute-node with Nvidia GRID and guests running
with a VGPU attached, the guests will often have their VGPU re-created
on a different GPU than before the reboot. This is not updated in
placement, causing the placement API to provide false information
about which resource provider that is actually a valid allocation
candidate for a new VGPU.
Steps to reproduce:
1. Create a new instance with a VGPU attached, take note of wich GPU the VGPU is created on (with nvidia-smi vgpu)
2. Reboot the compute-node
3. Start the instance, and observe that its VGPU now lives on a different GPU
4. Check "openstack allocation candidate list --resource VGPU=1" and correlate the resource provider id with "openstack resource provider list" to see that placement now will list the allocated GPU as free, and the inital GPU (from before the reboot) is still marked as used.
This will obviously only be an issue on compute-nodes with multiple
physical GPUs.
Examples:
https://paste.ubuntu.com/p/PZ6qgKtnRb/
This will eventually cause scheduling of new VGPU instances to fail,
because they will try to use a device that in reality is already used
(but marked as available in placement)
Expected results:
Either that the GRID-driver and libvirt should ensure that an instance keeps the same GPU for its VGPU through reboots (effectively making this.. not a nova bug)
OR
nova-compute should notify placement of the change and update the
allocations
Versions:
This was first observed in stein, but the issue is also present in train.
# rpm -qa | grep nova
python2-nova-20.3.0-1.el7.noarch
python2-novaclient-15.1.1-1.el7.noarch
openstack-nova-compute-20.3.0-1.el7.noarch
openstack-nova-common-20.3.0-1.el7.noarch
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1893904/+subscriptions
References