← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1827453] [NEW] Nova scheduler attempts to re-assign currently in-use SR-IOV VF to new VM

 

Public bug reported:

Running a small cluster with 16 compute nodes and 3 controller nodes on
OpenStack Queens using SR-IOV VFs.  From time to time, it appears that
the Nova scheduler loses track of some of the PCI devices (VFs) that are
actively mapped into servers.  We don't know exactly when this occurs
and we cannot trigger it on demand, but it occurs on a number of the
compute nodes over time.  Restarting the given compute node resolves the
issue.

The problem is manifest with the following errors:

/var/log/nova/nova-conductor.log:2019-05-03 01:35:27.309 13073 ERROR
nova.scheduler.utils [req-8418eb3a-4118-4505-97e3-fffbaae7aae6
2469493ff8b546ff9a6f4e339cc50ac2 33bb32d9463340bca0bb72a8c36579a9 -
default default] [instance: b2b4dbf2-d381-4416-95c9-b410aa6d8377] Error
from last host: node05 (node {REDACTED}): [u'Traceback (most recent call
last):\n', u'  File "/usr/lib/python2.7/dist-
packages/nova/compute/manager.py", line 1828, in
_do_build_and_run_instance\n    filter_properties, request_spec)\n', u'
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line
2108, in _build_and_run_instance\n    instance_uuid=instance.uuid,
reason=six.text_type(e))\n', u'RescheduledException: Build of instance
b2b4dbf2-d381-4416-95c9-b410aa6d8377 was re-scheduled: Requested
operation is not valid: PCI device 0000:04:01.3 is in use by driver
QEMU, domain instance-00001466\n']

The compute nodes in question are configured with the following PCI
whitelist:

[pci]
passthrough_whitelist = [{"vendor_id": "15b3", "product_id": "1004"}]

Note the, despite similar bugs, there haven't been changes to the
whitelist that would likely cause this to occur.  It just seems to
develop over time.

===== Versions =====

Compute nodes:

ii  nova-common                           2:17.0.6-0ubuntu1                      all          OpenStack Compute - common files
ii  nova-compute                          2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node base
ii  nova-compute-kvm                      2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node (KVM)
ii  nova-compute-libvirt                  2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node libvirt support

Controller nodes:

ii  nova-api                              2:17.0.9-0ubuntu1                           all          OpenStack Compute - API frontend
ii  nova-common                           2:17.0.9-0ubuntu1                           all          OpenStack Compute - common files
ii  nova-compute                          2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node base
ii  nova-compute-kvm                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node (KVM)
ii  nova-compute-libvirt                  2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node libvirt support
ii  nova-conductor                        2:17.0.9-0ubuntu1                           all          OpenStack Compute - conductor service
ii  nova-consoleauth                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - Console Authenticator
ii  nova-novncproxy                       2:17.0.9-0ubuntu1                           all          OpenStack Compute - NoVNC proxy
ii  nova-placement-api                    2:17.0.9-0ubuntu1                           all          OpenStack Compute - placement API frontend
ii  nova-scheduler                        2:17.0.9-0ubuntu1                           all          OpenStack Compute - virtual machine scheduler
ii  nova-serialproxy                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - serial proxy
ii  nova-xvpvncproxy                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - XVP VNC proxy

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1827453

Title:
  Nova scheduler attempts to re-assign currently in-use SR-IOV VF to new
  VM

Status in OpenStack Compute (nova):
  New

Bug description:
  Running a small cluster with 16 compute nodes and 3 controller nodes
  on OpenStack Queens using SR-IOV VFs.  From time to time, it appears
  that the Nova scheduler loses track of some of the PCI devices (VFs)
  that are actively mapped into servers.  We don't know exactly when
  this occurs and we cannot trigger it on demand, but it occurs on a
  number of the compute nodes over time.  Restarting the given compute
  node resolves the issue.

  The problem is manifest with the following errors:

  /var/log/nova/nova-conductor.log:2019-05-03 01:35:27.309 13073 ERROR
  nova.scheduler.utils [req-8418eb3a-4118-4505-97e3-fffbaae7aae6
  2469493ff8b546ff9a6f4e339cc50ac2 33bb32d9463340bca0bb72a8c36579a9 -
  default default] [instance: b2b4dbf2-d381-4416-95c9-b410aa6d8377]
  Error from last host: node05 (node {REDACTED}): [u'Traceback (most
  recent call last):\n', u'  File "/usr/lib/python2.7/dist-
  packages/nova/compute/manager.py", line 1828, in
  _do_build_and_run_instance\n    filter_properties, request_spec)\n',
  u'  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py",
  line 2108, in _build_and_run_instance\n
  instance_uuid=instance.uuid, reason=six.text_type(e))\n',
  u'RescheduledException: Build of instance
  b2b4dbf2-d381-4416-95c9-b410aa6d8377 was re-scheduled: Requested
  operation is not valid: PCI device 0000:04:01.3 is in use by driver
  QEMU, domain instance-00001466\n']

  The compute nodes in question are configured with the following PCI
  whitelist:

  [pci]
  passthrough_whitelist = [{"vendor_id": "15b3", "product_id": "1004"}]

  Note the, despite similar bugs, there haven't been changes to the
  whitelist that would likely cause this to occur.  It just seems to
  develop over time.

  ===== Versions =====

  Compute nodes:

  ii  nova-common                           2:17.0.6-0ubuntu1                      all          OpenStack Compute - common files
  ii  nova-compute                          2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node base
  ii  nova-compute-kvm                      2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node (KVM)
  ii  nova-compute-libvirt                  2:17.0.6-0ubuntu1                      all          OpenStack Compute - compute node libvirt support

  Controller nodes:

  ii  nova-api                              2:17.0.9-0ubuntu1                           all          OpenStack Compute - API frontend
  ii  nova-common                           2:17.0.9-0ubuntu1                           all          OpenStack Compute - common files
  ii  nova-compute                          2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node base
  ii  nova-compute-kvm                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node (KVM)
  ii  nova-compute-libvirt                  2:17.0.9-0ubuntu1                           all          OpenStack Compute - compute node libvirt support
  ii  nova-conductor                        2:17.0.9-0ubuntu1                           all          OpenStack Compute - conductor service
  ii  nova-consoleauth                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - Console Authenticator
  ii  nova-novncproxy                       2:17.0.9-0ubuntu1                           all          OpenStack Compute - NoVNC proxy
  ii  nova-placement-api                    2:17.0.9-0ubuntu1                           all          OpenStack Compute - placement API frontend
  ii  nova-scheduler                        2:17.0.9-0ubuntu1                           all          OpenStack Compute - virtual machine scheduler
  ii  nova-serialproxy                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - serial proxy
  ii  nova-xvpvncproxy                      2:17.0.9-0ubuntu1                           all          OpenStack Compute - XVP VNC proxy

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1827453/+subscriptions


Follow ups