← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1892361] Re: SRIOV instance gets type-PF interface, libvirt kvm fails

 

Reviewed:  https://review.opendev.org/749175
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8695de6da56db42b83b9d9d4c330148766644be
Submitter: Zuul
Branch:    master

commit b8695de6da56db42b83b9d9d4c330148766644be
Author: Hemanth Nakkina <hemanth.nakkina@xxxxxxxxxxxxx>
Date:   Tue Sep 1 09:36:51 2020 +0530

    Update pci stat pools based on PCI device changes
    
    At start up of nova-compute service, the PCI stat pools are
    populated based on information in pci_devices table in Nova
    database. The pools are updated only when new device is added
    or removed but not on any device changes like device type.
    
    If an existing device is configured as SRIOV and nova-compute
    is restarted, the pci_devices table gets updated but the device
    is still listed under the old pool in pci_tracker.stats.pool
    (in-memory object).
    
    This patch looks for device type updates in existing devices
    and updates the pools accordingly.
    
    Change-Id: Id4ebb06e634a612c8be4be6c678d8265e0b99730
    Closes-Bug: #1892361


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1892361

Title:
  SRIOV instance gets type-PF interface, libvirt kvm fails

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:
  When spawning an SR-IOV enabled instance on a newly deployed host,
  nova attempts to spawn it with an type-PF pci device. This fails with
  the below stack trace.

  After restarting neutron-sriov-agent and nova-compute services on the
  compute node and spawning an SR-IOV instance again, a type-VF pci
  device is selected, and instance spawning succeeds.

  Stack trace:
  2020-08-20 08:29:09.558 7624 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 6db8011e6ecd4fd0aaa53c8f89f08b1b __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:400
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] [insta
  nce: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Instance failed to spawn: libvirtError: unsupported configuration: Interface type hostdev is currently supported on SR-IOV Virtual Functions only
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Traceback (most recent call last):
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2274, in _build_resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     yield resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2054, in _build_and_run_instance
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     block_device_info=block_device_info)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3147, in spawn
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     destroy_disks_on_failure=True)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5651, in _create_domain_and_network
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     destroy_disks_on_failure)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self.force_reraise()
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(self.type_, self.value, self.tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5620, in _create_domain_and_network
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     post_xml_callback=post_xml_callback)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5555, in _create_domain
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     guest.launch(pause=pause)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 144, in launch
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self._encoded_xml, errors='ignore')
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     self.force_reraise()
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(self.type_, self.value, self.tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 139, in launch
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     return self._domain.createWithFlags(flags)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     result = proxy_call(self._autowrap, f, *args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     rv = execute(f, *args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     six.reraise(c, e, tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     rv = meth(*args, **kwargs)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1092, in createWithFlags
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] libvirtError: unsupported configuration: Interface type hostdev is currently supported on SR-IOV Virtual Functions only
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] 
  2020-08-20 08:29:09.599 7624 INFO nova.compute.manager [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Terminating instance

  
  To reproduce, bring up an instance with an SR-IOV port on a freshly deployed compute:

  + openstack port create -f value -c id --network testinstance_net --vnic-type=direct --binding-profile type=dict --binding-profile physical_network=physnet2 testinstance_net-port  
  + openstack server create --flavor ce6da933-adc3-4e5f-a688-63b037705729 --image a3580f59-a6c6-41f6-85fa-2fc7277492a1 --nic port-id=547cd89a-3f91-4646-84d9-c9559b497526 --availability-zone nova:foo-compute-host testinstance_vanilla_66016d81-bc32-4def-a7b3-a3a164ca5164 

  Observe that a PF is getting selected for the sriov nic.

  From nova-compute.log:

      <interface type='hostdev' managed='yes'>
        <mac address='98:03:9b:61:22:e9'/>
        <source>
          <address type='pci' domain='0x0000' bus='0xd8' slot='0x00' function='0x1'/>
        </source>
        <vlan>
          <tag id='48'/>
        </vlan>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </interface>
  ...
  2020-08-20 08:29:09.056 7624 DEBUG nova.virt.libvirt.vif [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] 
  vif_type=hw_veb ...
  vif={"profile": 
    {"pci_slot": "0000:d8:00.1", "physical_network": "physnet2", "pci_vendor_info": "15b3:1015"}, 
    "ovs_interfaceid": null, "preserve_on_delete": true, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], 
    "address": "192.168.0.5"}], "version": 4, "meta": {"dhcp_server": "192.168.0.2"}, "dns": [], "routes": [], "cidr": "192.168.0.0/29", 
    "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.0.1"}}], "meta": {"injected": false, "tenant_id": "dd99e7950a5b46b5b924ccd1720b6257", 
    "physical_network": "physnet2", "mtu": 9000}, 
    "id": "60b3001e-21c1-4947-8996-314449f614c060b3001e-21c1-4947-8996-314449f614c0", "label": "net_20Aug-1"}, "devname": "tapf3953098-98", "vnic_type": "direct", "qbh_params": null, "meta": {}, 
    "details": {"port_filter": false, "vlan": "48"}, "address": "98:03:9b:61:22:e9", "active": false, "type": "hw_veb", "id": "f3953098-98f7-4dd1-8b31-11f51a5a760f", "qbg_params": null} 
  virt_type=kvm get_config /usr/lib/python2.7/dist-packages/nova/virt/libvirt/vif.py:572

  Device is a PF:

  # lspci | grep d8:00.1
  d8:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

  Also the nova pci_devices table has it's dev_type correctly listed:

  mysql> select compute_nodes.host, pci_devices.created_at, compute_node_id, address, dev_type, status, pci_devices.dev_id from pci_devices join compute_nodes ON (compute_nodes.id = pci_devices.compute_node_id) where   compute_nodes.host = 'foo-compute-host' and pci_devices.dev_type = 'type-PF';
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+
  | host             | created_at          | compute_node_id | address      | dev_type | status    | dev_id           |
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+
  | foo-compute-host | 2020-08-12 17:10:19 |              95 | 0000:19:00.1 | type-PF  | available | pci_0000_19_00_1 |
  | foo-compute-host | 2020-08-12 17:10:19 |              95 | 0000:d8:00.1 | type-PF  | available | pci_0000_d8_00_1 |
  +------------------+---------------------+-----------------+--------------+----------+-----------+------------------+

  Restarting services:

  # systemctl status neutron-sriov-agent.service 
  # systemctl restart neutron-sriov-agent.service 

  Spawning an instance again, it gets allocated a type-VF port (and
  spawning succeeds):

      <interface type='hostdev' managed='yes'>
        <mac address='fa:16:3e:34:d2:99'/>
        <source>
          <address type='pci' domain='0x0000' bus='0xd8' slot='0x05' function='0x1'/>
        </source>
        <vlan>
          <tag id='4'/>
        </vlan>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </interface>

  # lspci | grep d8:05.1
  d8:05.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]

  
  After spawning an instance, the PF get marked as "unavailable" in the nova db:

  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+
  | host             | created_at          | updated_at          | instance_uuid | compute_node_id | address      | dev_type | status      | dev_id           |
  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+
  | foo-compute-host | 2020-08-12 17:10:19 | 2020-08-20 11:45:07 | NULL          |              95 | 0000:19:00.1 | type-PF  | available   | pci_0000_19_00_1 |
  | foo-compute-host | 2020-08-12 17:10:19 | 2020-08-20 11:46:30 | NULL          |              95 | 0000:d8:00.1 | type-PF  | unavailable | pci_0000_d8_00_1 |
  +------------------+---------------------+---------------------+---------------+-----------------+--------------+----------+-------------+------------------+


  Software versions:

  # dpkg -l | grep nova-common
  ii  nova-common                            2:17.0.12-0ubuntu1                              all          OpenStack Compute - common files
  # dpkg -l | grep libvirt0
  ii  libvirt0:amd64                         4.0.0-1ubuntu8.17                               amd64        library for interfacing with different virtualization systems
  # lsb_release -r
  Release:        18.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1892361/+subscriptions


References