← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1915255] Re: [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/777679
Committed: https://opendev.org/openstack/nova/commit/a569a51fedd058fdae2eb0066e087c37688987f8
Submitter: "Zuul (22348)"
Branch:    master

commit a569a51fedd058fdae2eb0066e087c37688987f8
Author: Sean Mooney <work@xxxxxxxxxxxxxxx>
Date:   Fri May 21 14:45:45 2021 +0100

    fix sr-iov support on Cavium ThunderX hosts.
    
    This change is a partial revert of
    Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    to reintoduce a try-except which is required for
    some non standard hardware.
    
    On the Cavium ThunderX platform, it's possible to have
    virutal functions which are netdevs which are not associated
    to a PF. This causes the PF name lookup to fail.
    Prior to Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    when the lookup failed it was caught and we skipped
    populating the parent PF interface name.
    
    This change restores that behavior.
    
    Closes-Bug: #1915255
    Change-Id: Ia10ccdd9fbed3870d0592e3cbbff17f292651dd2


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915255

Title:
  [Victoria] nova-compute won't start on aarch64 - raises
  PciDeviceNotFoundById

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) victoria series:
  Triaged

Bug description:
  Description
  ===========

  When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on
  arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-
  compute.log):

  ----------
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address
      dev_info = os.listdir(dev_path)
  FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node
      self.rt.update_available_resource(context, nodename,
    File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource
      resources = self.driver.get_available_resource(nodename)
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource
      data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices
      pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in <listcomp>
      pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info
      device.update(_get_device_type(cfgdev, address, dev, net_devs))
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type
      parent_ifname = pci_utils.get_ifname_by_pci_address(
    File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
      raise exception.PciDeviceNotFoundById(id=pci_addr)
  nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
  ----------

  This results in an empty `openstack hypervisor list`.

  This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We
  also haven't seen this on other architectures (yet?). This code
  actually appeared between Ussuri and Victoria, [0] i.e. the first
  version having it is 22.0.0.

  $ lspci | grep 0002:01:00.1
  0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09)

  Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net`
  but I'm not sure if that's really a problem or if nova-compute should
  just catch the exception and move on?

  A similar issue in the past [1] shows that this might be an issue
  specific to the Cavium Thunder X NIC.

  Related issue: [2]

  Steps to reproduce
  ==================

  Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium
  Thunder X NIC if possible). I personally use Juju [3] for deploying an
  entire OpenStack Victoria setup to a lab:

  $ git clone https://github.com/openstack-charmers/openstack-bundles
  $ cd openstack-bundles/development/openstack-base-focal-victoria/
  $ juju deploy ./bundle.yaml

  Expected result
  ===============

  `openstack hypervisor list` shows at least one hypervisor.
  nova-compute.log doesn't contain nova.exception.PciDeviceNotFoundById

  Actual result
  =============

  `openstack hypervisor list` doesn't show any hypervisor.
  nova-compute.log contains nova.exception.PciDeviceNotFoundById

  Environment
  ===========

  $ dpkg -l | grep nova
  ii  nova-api-metadata                    2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - metadata API frontend
  ii  nova-common                          2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - common files
  ii  nova-compute                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node base
  ii  nova-compute-kvm                     2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node (KVM)
  ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
  ii  python3-nova                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute Python 3 libraries
  ii  python3-novaclient                   2:17.2.1-0ubuntu1~cloud0                             all          client library for OpenStack Compute API - 3.x

  # cat /etc/nova/nova-compute.conf
  [DEFAULT]
  compute_driver=libvirt.LibvirtDriver
  [libvirt]
  virt_type=kvm

  $ dpkg -l | grep libvirt
  ii  libvirt-clients                      6.0.0-0ubuntu8.5                                     arm64        Programs for the libvirt library
  ii  libvirt-daemon                       6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon
  ii  libvirt-daemon-driver-qemu           6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon QEMU connection driver
  ii  libvirt-daemon-driver-storage-rbd    6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon RBD storage driver
  ii  libvirt-daemon-system                6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files
  ii  libvirt-daemon-system-systemd        6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files (systemd)
  ii  libvirt0:arm64                       6.0.0-0ubuntu8.5                                     arm64        library for interfacing with different virtualization systems
  ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
  ii  python3-libvirt                      6.1.0-1                                              arm64        libvirt Python 3 bindings

  This shouldn't be relevant but:

  * Ceph 15.2.7 for storage
  * Neutron with OVN

  Logs & Configs
  ==============

  sosreport attached.

  [0] https://opendev.org/openstack/nova/commit/efc27ff84c3
  [1] https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1724999
  [3] https://jaas.ai/openstack-base

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1915255/+subscriptions


References