← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1915255] [NEW] [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById

 

Public bug reported:

Description
===========

When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on
arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-
compute.log):

----------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address
    dev_info = os.listdir(dev_path)
FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node
    self.rt.update_available_resource(context, nodename,
  File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource
    resources = self.driver.get_available_resource(nodename)
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource
    data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices
    pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in <listcomp>
    pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info
    device.update(_get_device_type(cfgdev, address, dev, net_devs))
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type
    parent_ifname = pci_utils.get_ifname_by_pci_address(
  File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
    raise exception.PciDeviceNotFoundById(id=pci_addr)
nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
----------

This results in an empty `openstack hypervisor list`.

This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We
also haven't seen this on other architectures (yet?). This code actually
appeared between Ussuri and Victoria, [0] i.e. the first version having
it is 22.0.0.

$ lspci | grep 0002:01:00.1
0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09)

Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net`
but I'm not sure if that's really a problem or if nova-compute should
just catch the exception and move on?

A similar issue in the past [1] shows that this might be an issue
specific to the Cavium Thunder X NIC.

Related issue: [2]

Steps to reproduce
==================

Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium
Thunder X NIC if possible). I personally use Juju [3] for deploying an
entire OpenStack Victoria setup to a lab:

$ git clone https://github.com/openstack-charmers/openstack-bundles
$ cd openstack-bundles/development/openstack-base-focal-victoria/
$ juju deploy ./bundle.yaml

Expected result
===============

`openstack hypervisor list` shows at least one hypervisor.
nova-compute.log doesn't contain nova.exception.PciDeviceNotFoundById

Actual result
=============

`openstack hypervisor list` doesn't show any hypervisor.
nova-compute.log contains nova.exception.PciDeviceNotFoundById

Environment
===========

$ dpkg -l | grep nova
ii  nova-api-metadata                    2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - metadata API frontend
ii  nova-common                          2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - common files
ii  nova-compute                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node base
ii  nova-compute-kvm                     2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node (KVM)
ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
ii  python3-nova                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute Python 3 libraries
ii  python3-novaclient                   2:17.2.1-0ubuntu1~cloud0                             all          client library for OpenStack Compute API - 3.x

# cat /etc/nova/nova-compute.conf
[DEFAULT]
compute_driver=libvirt.LibvirtDriver
[libvirt]
virt_type=kvm

$ dpkg -l | grep libvirt
ii  libvirt-clients                      6.0.0-0ubuntu8.5                                     arm64        Programs for the libvirt library
ii  libvirt-daemon                       6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon
ii  libvirt-daemon-driver-qemu           6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon QEMU connection driver
ii  libvirt-daemon-driver-storage-rbd    6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon RBD storage driver
ii  libvirt-daemon-system                6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files
ii  libvirt-daemon-system-systemd        6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files (systemd)
ii  libvirt0:arm64                       6.0.0-0ubuntu8.5                                     arm64        library for interfacing with different virtualization systems
ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
ii  python3-libvirt                      6.1.0-1                                              arm64        libvirt Python 3 bindings

This shouldn't be relevant but:

* Ceph 15.2.7 for storage
* Neutron with OVN

Logs & Configs
==============

sosreport attached.

[0] https://opendev.org/openstack/nova/commit/efc27ff84c3
[1] https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1724999
[3] https://jaas.ai/openstack-base

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "sosreport-node-egede-2021-02-10-lzrvidh.tar.xz"
   https://bugs.launchpad.net/bugs/1915255/+attachment/5462220/+files/sosreport-node-egede-2021-02-10-lzrvidh.tar.xz

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915255

Title:
  [Victoria] nova-compute won't start on aarch64 - raises
  PciDeviceNotFoundById

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on
  arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-
  compute.log):

  ----------
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address
      dev_info = os.listdir(dev_path)
  FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node
      self.rt.update_available_resource(context, nodename,
    File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource
      resources = self.driver.get_available_resource(nodename)
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource
      data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices
      pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in <listcomp>
      pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info
      device.update(_get_device_type(cfgdev, address, dev, net_devs))
    File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type
      parent_ifname = pci_utils.get_ifname_by_pci_address(
    File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
      raise exception.PciDeviceNotFoundById(id=pci_addr)
  nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
  ----------

  This results in an empty `openstack hypervisor list`.

  This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We
  also haven't seen this on other architectures (yet?). This code
  actually appeared between Ussuri and Victoria, [0] i.e. the first
  version having it is 22.0.0.

  $ lspci | grep 0002:01:00.1
  0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09)

  Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net`
  but I'm not sure if that's really a problem or if nova-compute should
  just catch the exception and move on?

  A similar issue in the past [1] shows that this might be an issue
  specific to the Cavium Thunder X NIC.

  Related issue: [2]

  Steps to reproduce
  ==================

  Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium
  Thunder X NIC if possible). I personally use Juju [3] for deploying an
  entire OpenStack Victoria setup to a lab:

  $ git clone https://github.com/openstack-charmers/openstack-bundles
  $ cd openstack-bundles/development/openstack-base-focal-victoria/
  $ juju deploy ./bundle.yaml

  Expected result
  ===============

  `openstack hypervisor list` shows at least one hypervisor.
  nova-compute.log doesn't contain nova.exception.PciDeviceNotFoundById

  Actual result
  =============

  `openstack hypervisor list` doesn't show any hypervisor.
  nova-compute.log contains nova.exception.PciDeviceNotFoundById

  Environment
  ===========

  $ dpkg -l | grep nova
  ii  nova-api-metadata                    2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - metadata API frontend
  ii  nova-common                          2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - common files
  ii  nova-compute                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node base
  ii  nova-compute-kvm                     2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node (KVM)
  ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
  ii  python3-nova                         2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute Python 3 libraries
  ii  python3-novaclient                   2:17.2.1-0ubuntu1~cloud0                             all          client library for OpenStack Compute API - 3.x

  # cat /etc/nova/nova-compute.conf
  [DEFAULT]
  compute_driver=libvirt.LibvirtDriver
  [libvirt]
  virt_type=kvm

  $ dpkg -l | grep libvirt
  ii  libvirt-clients                      6.0.0-0ubuntu8.5                                     arm64        Programs for the libvirt library
  ii  libvirt-daemon                       6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon
  ii  libvirt-daemon-driver-qemu           6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon QEMU connection driver
  ii  libvirt-daemon-driver-storage-rbd    6.0.0-0ubuntu8.5                                     arm64        Virtualization daemon RBD storage driver
  ii  libvirt-daemon-system                6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files
  ii  libvirt-daemon-system-systemd        6.0.0-0ubuntu8.5                                     arm64        Libvirt daemon configuration files (systemd)
  ii  libvirt0:arm64                       6.0.0-0ubuntu8.5                                     arm64        library for interfacing with different virtualization systems
  ii  nova-compute-libvirt                 2:22.0.1-0ubuntu1~cloud0                             all          OpenStack Compute - compute node libvirt support
  ii  python3-libvirt                      6.1.0-1                                              arm64        libvirt Python 3 bindings

  This shouldn't be relevant but:

  * Ceph 15.2.7 for storage
  * Neutron with OVN

  Logs & Configs
  ==============

  sosreport attached.

  [0] https://opendev.org/openstack/nova/commit/efc27ff84c3
  [1] https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1724999
  [3] https://jaas.ai/openstack-base

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1915255/+subscriptions


Follow ups