← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1893904] [NEW] Placement is not updated if a VGPU is re-created on new mdev upon host reboot

 

Public bug reported:

First of all, I'm not really sure which project to "blame" for this bug,
but here's the problem:

When you reboot a compute-node with Nvidia GRID and guests running with
a VGPU attached, the guests will often have their VGPU re-created on a
different mdev than before the reboot. This is not updated in placement,
causing the placement API to provide false information about which
resource provider that is actually a valid allocation candidate for a
new VGPU.

Steps to reproduce:
1. Create a new instance with a VGPU attached, take note of wich mdev the VGPU is created on (with nvidia-smi vgpu)
2. Reboot the compute-node
3. Start the instance, and observe that its VGPU now lives on a different mdev
4. Check "openstack allocation candidate list --resource VGPU=1" and correlate the resource provider id with "openstack resource provider list" to see that placement now will list the allocated mdev ass free, and the inital mdev (from before the reboot) is still marked as used.

Examples:
https://paste.ubuntu.com/p/PZ6qgKtnRb/

This will eventually cause scheduling of new VGPU instances to fail,
because they will try to use a device that in reality is already used
(but marked as available in placement)

Expected results:
Either that the GRID-driver and libvirt should ensure that an instance keeps the same mdev for its VGPU through reboots (effectively making this.. not a nova bug)

OR

nova-compute should notify placement of the change and update the
allocations

Versions:
This was first observed in stein, but the issue is also present in train.
# rpm -qa | grep nova
python2-nova-20.3.0-1.el7.noarch
python2-novaclient-15.1.1-1.el7.noarch
openstack-nova-compute-20.3.0-1.el7.noarch
openstack-nova-common-20.3.0-1.el7.noarch

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1893904

Title:
  Placement is not updated if a VGPU is re-created on new mdev upon host
  reboot

Status in OpenStack Compute (nova):
  New

Bug description:
  First of all, I'm not really sure which project to "blame" for this
  bug, but here's the problem:

  When you reboot a compute-node with Nvidia GRID and guests running
  with a VGPU attached, the guests will often have their VGPU re-created
  on a different mdev than before the reboot. This is not updated in
  placement, causing the placement API to provide false information
  about which resource provider that is actually a valid allocation
  candidate for a new VGPU.

  Steps to reproduce:
  1. Create a new instance with a VGPU attached, take note of wich mdev the VGPU is created on (with nvidia-smi vgpu)
  2. Reboot the compute-node
  3. Start the instance, and observe that its VGPU now lives on a different mdev
  4. Check "openstack allocation candidate list --resource VGPU=1" and correlate the resource provider id with "openstack resource provider list" to see that placement now will list the allocated mdev ass free, and the inital mdev (from before the reboot) is still marked as used.

  Examples:
  https://paste.ubuntu.com/p/PZ6qgKtnRb/

  This will eventually cause scheduling of new VGPU instances to fail,
  because they will try to use a device that in reality is already used
  (but marked as available in placement)

  Expected results:
  Either that the GRID-driver and libvirt should ensure that an instance keeps the same mdev for its VGPU through reboots (effectively making this.. not a nova bug)

  OR

  nova-compute should notify placement of the change and update the
  allocations

  Versions:
  This was first observed in stein, but the issue is also present in train.
  # rpm -qa | grep nova
  python2-nova-20.3.0-1.el7.noarch
  python2-novaclient-15.1.1-1.el7.noarch
  openstack-nova-compute-20.3.0-1.el7.noarch
  openstack-nova-common-20.3.0-1.el7.noarch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1893904/+subscriptions


Follow ups