← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1892361] Re: SRIOV instance gets type-PF interface, libvirt kvm fails

 

This bug was fixed in the package nova - 2:21.1.1-0ubuntu2

---------------
nova (2:21.1.1-0ubuntu2) focal; urgency=medium

  * d/p/lp1892361.patch: Update pci stat pools based on PCI device
changes (LP: #1892361).

 -- Chris MacNaughton <chris.macnaughton@xxxxxxxxxx>  Mon, 18 Jan 2021
15:25:16 +0000

** Changed in: nova (Ubuntu Focal)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1892361

Title:
  SRIOV instance gets type-PF interface, libvirt kvm fails

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  In Progress
Status in OpenStack Compute (nova) train series:
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in OpenStack Compute (nova) victoria series:
  Fix Released
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Bionic:
  New
Status in nova source package in Focal:
  Fix Released
Status in nova source package in Groovy:
  Fix Released
Status in nova source package in Hirsute:
  Fix Released

Bug description:
  When spawning an SR-IOV enabled instance on a newly deployed host,
  nova attempts to spawn it with an type-PF pci device. This fails with
  the below stack trace.

  After restarting neutron-sriov-agent and nova-compute services on the
  compute node and spawning an SR-IOV instance again, a type-VF pci
  device is selected, and instance spawning succeeds.

  Stack trace:
  2020-08-20 08:29:09.558 7624 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 6db8011e6ecd4fd0aaa53c8f89f08b1b __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:400
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] [insta
  nce: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Instance failed to spawn: libvirtError: unsupported configuration: Interface type hostdev is currently supported on SR-IOV Virtual Functions only
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Traceback (most recent call last):
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2274, in _build_resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     yield resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2054, in _build_and_run_instance
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     block_device_info=block_device_info)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3147, in spawn
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     destroy_disks_on_failure=True)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5651, in _create_domain_and_network
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     destroy_disks_on_failure)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self.force_reraise()
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(self.type_, self.value, self.tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5620, in _create_domain_and_network
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     post_xml_callback=post_xml_callback)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5555, in _create_domain
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     guest.launch(pause=pause)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 144, in launch
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self._encoded_xml, errors='ignore')
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self.force_reraise()
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(self.type_, self.value, self.tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 139, in launch
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     return self._domain.createWithFlags(flags)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     result = proxy_call(self._autowrap, f, *args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     rv = execute(f, *args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(c, e, tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     rv = meth(*args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1092, in createWithFlags
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] libvirtError: unsupported configuration: Interface type hostdev is currently supported on SR-IOV Virtual Functions only
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]
  2020-08-20 08:29:09.599 7624 INFO nova.compute.manager [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Terminating instance

  To reproduce, bring up an instance with an SR-IOV port on a freshly
  deployed compute:

  + openstack port create -f value -c id --network testinstance_net --vnic-type=direct --binding-profile type=dict --binding-profile physical_network=physnet2 testinstance_net-port
  + openstack server create --flavor ce6da933-adc3-4e5f-a688-63b037705729 --image a3580f59-a6c6-41f6-85fa-2fc7277492a1 --nic port-id=547cd89a-3f91-4646-84d9-c9559b497526 --availability-zone nova:foo-compute-host testinstance_vanilla_66016d81-bc32-4def-a7b3-a3a164ca5164

  Observe that a PF is getting selected for the sriov nic.

  From nova-compute.log:

      <interface type='hostdev' managed='yes'>
        <mac address='98:03:9b:61:22:e9'/>
        <source>
          <address type='pci' domain='0x0000' bus='0xd8' slot='0x00' function='0x1'/>
        </source>
        <vlan>
          <tag id='48'/>
        </vlan>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </interface>
  ...
  2020-08-20 08:29:09.056 7624 DEBUG nova.virt.libvirt.vif [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b]
  vif_type=hw_veb ...
  vif={"profile":
    {"pci_slot": "0000:d8:00.1", "physical_network": "physnet2", "pci_vendor_info": "15b3:1015"},
    "ovs_interfaceid": null, "preserve_on_delete": true, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [],
    "address": "192.168.0.5"}], "version": 4, "meta": {"dhcp_server": "192.168.0.2"}, "dns": [], "routes": [], "cidr": "192.168.0.0/29",
    "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.0.1"}}], "meta": {"injected": false, "tenant_id": "dd99e7950a5b46b5b924ccd1720b6257",
    "physical_network": "physnet2", "mtu": 9000},
    "id": "60b3001e-21c1-4947-8996-314449f614c060b3001e-21c1-4947-8996-314449f614c0", "label": "net_20Aug-1"}, "devname": "tapf3953098-98", "vnic_type": "direct", "qbh_params": null, "meta": {},
    "details": {"port_filter": false, "vlan": "48"}, "address": "98:03:9b:61:22:e9", "active": false, "type": "hw_veb", "id": "f3953098-98f7-4dd1-8b31-11f51a5a760f", "qbg_params": null}
  virt_type=kvm get_config /usr/lib/python2.7/dist-packages/nova/virt/libvirt/vif.py:572

  Device is a PF:

  # lspci | grep d8:00.1
  d8:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

  Also the nova pci_devices table has it's dev_type correctly listed:

  mysql> select compute_nodes.host, pci_devices.created_at, compute_node_id, address, dev_type, status, pci_devices.dev_id from pci_devices join compute_nodes ON (compute_nodes.id = pci_devices.compute_node_id) where   compute_nodes.host = 'foo-compute-host' and pci_devices.dev_type = 'type-PF';
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+
  | host             | created_at          | compute_node_id | address      | dev_type | status    | dev_id           |
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+
  | foo-compute-host | 2020-08-12 17:10:19 |              95 | 0000:19:00.1 | type-PF  | available | pci_0000_19_00_1 |
  | foo-compute-host | 2020-08-12 17:10:19 |              95 | 0000:d8:00.1 | type-PF  | available | pci_0000_d8_00_1 |
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+

  Restarting services:

  # systemctl status neutron-sriov-agent.service
  # systemctl restart neutron-sriov-agent.service

  Spawning an instance again, it gets allocated a type-VF port (and
  spawning succeeds):

      <interface type='hostdev' managed='yes'>
        <mac address='fa:16:3e:34:d2:99'/>
        <source>
          <address type='pci' domain='0x0000' bus='0xd8' slot='0x05' function='0x1'/>
        </source>
        <vlan>
          <tag id='4'/>
        </vlan>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </interface>

  # lspci | grep d8:05.1
  d8:05.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]

  After spawning an instance, the PF get marked as "unavailable" in the
  nova db:

  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+
  | host             | created_at          | updated_at          | instance_uuid | compute_node_id | address      | dev_type | status      | dev_id           |
  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+
  | foo-compute-host | 2020-08-12 17:10:19 | 2020-08-20 11:45:07 | NULL          |              95 | 0000:19:00.1 | type-PF  | available   | pci_0000_19_00_1 |
  | foo-compute-host | 2020-08-12 17:10:19 | 2020-08-20 11:46:30 | NULL          |              95 | 0000:d8:00.1 | type-PF  | unavailable | pci_0000_d8_00_1 |
  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+

  Software versions:

  # dpkg -l | grep nova-common
  ii  nova-common                            2:17.0.12-0ubuntu1                              all          OpenStack Compute - common files
  # dpkg -l | grep libvirt0
  ii  libvirt0:amd64                         4.0.0-1ubuntu8.17                               amd64        library for interfacing with different virtualization systems
  # lsb_release -r
  Release:        18.04

  
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  [Impact]

  Spawning an SR-IOV instance fails on a newly deployed compute. 
  Nova attempts to spawn a PCI device of type type-PCI instead of type-VF.

  This was happened in OpenStack Queens deployment.

  [Test case]

  1. Issue can be reproduced by following steps in comment #3
     https://bugs.launchpad.net/nova/+bug/1892361/comments/3

  2. Install the package with fixed code

  3. Confirm bug have been fixed
     Repeat the steps mentioned in comment #3 and check if the instance with sriov port is created successfully.

  [Where problems could occur]

  Upstream CI ran all the functional test cases that triggers this scenario. 
  Installation of new package will result in restart of nova-compute service.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1892361/+subscriptions


References