yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85135
[Bug 1915255] Re: [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById
This is a real issue because the Cavium ThunderX hardware violates an assumtion we have with regards to PF having netdevs if VF do.
we just need to re add this try excpet that was removed.
https://review.opendev.org/c/openstack/nova/+/739131/12/nova/virt/libvirt/driver.py#b6957
it was orginally removed as we are only looking at the sub set of VFs that are nics
but since the Cavium ThunderX does not assing a PF to all VFs
per https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
we need to catch the exception in this case as we did before.
this means that minium bandwidth based QOS cannot be implemented on
this hardware as we rely on the PF netdev name to correlate the
bandwidth between nova and neutron but other functionality shoudl work.
The only way to support min bandwith qos on thsi hardware would be to
altere the nic driver or enhance nova/neutron to support using the PF
pci address instead of the parent netdev name.
** Changed in: nova
Importance: Undecided => Medium
** Changed in: nova
Status: New => Triaged
** Also affects: nova/victoria
Importance: Undecided
Status: New
** Changed in: nova/victoria
Status: New => Triaged
** Changed in: nova/victoria
Importance: Undecided => Medium
** Changed in: nova
Assignee: (unassigned) => sean mooney (sean-k-mooney)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915255
Title:
[Victoria] nova-compute won't start on aarch64 - raises
PciDeviceNotFoundById
Status in OpenStack Compute (nova):
Triaged
Status in OpenStack Compute (nova) victoria series:
Triaged
Bug description:
Description
===========
When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on
arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-
compute.log):
----------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address
dev_info = os.listdir(dev_path)
FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node
self.rt.update_available_resource(context, nodename,
File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource
resources = self.driver.get_available_resource(nodename)
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource
data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices
pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in <listcomp>
pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info
device.update(_get_device_type(cfgdev, address, dev, net_devs))
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type
parent_ifname = pci_utils.get_ifname_by_pci_address(
File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
raise exception.PciDeviceNotFoundById(id=pci_addr)
nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
----------
This results in an empty `openstack hypervisor list`.
This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We
also haven't seen this on other architectures (yet?). This code
actually appeared between Ussuri and Victoria, [0] i.e. the first
version having it is 22.0.0.
$ lspci | grep 0002:01:00.1
0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09)
Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net`
but I'm not sure if that's really a problem or if nova-compute should
just catch the exception and move on?
A similar issue in the past [1] shows that this might be an issue
specific to the Cavium Thunder X NIC.
Related issue: [2]
Steps to reproduce
==================
Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium
Thunder X NIC if possible). I personally use Juju [3] for deploying an
entire OpenStack Victoria setup to a lab:
$ git clone https://github.com/openstack-charmers/openstack-bundles
$ cd openstack-bundles/development/openstack-base-focal-victoria/
$ juju deploy ./bundle.yaml
Expected result
===============
`openstack hypervisor list` shows at least one hypervisor.
nova-compute.log doesn't contain nova.exception.PciDeviceNotFoundById
Actual result
=============
`openstack hypervisor list` doesn't show any hypervisor.
nova-compute.log contains nova.exception.PciDeviceNotFoundById
Environment
===========
$ dpkg -l | grep nova
ii nova-api-metadata 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.2.1-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x
# cat /etc/nova/nova-compute.conf
[DEFAULT]
compute_driver=libvirt.LibvirtDriver
[libvirt]
virt_type=kvm
$ dpkg -l | grep libvirt
ii libvirt-clients 6.0.0-0ubuntu8.5 arm64 Programs for the libvirt library
ii libvirt-daemon 6.0.0-0ubuntu8.5 arm64 Virtualization daemon
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.5 arm64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-driver-storage-rbd 6.0.0-0ubuntu8.5 arm64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 6.0.0-0ubuntu8.5 arm64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 6.0.0-0ubuntu8.5 arm64 Libvirt daemon configuration files (systemd)
ii libvirt0:arm64 6.0.0-0ubuntu8.5 arm64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-libvirt 6.1.0-1 arm64 libvirt Python 3 bindings
This shouldn't be relevant but:
* Ceph 15.2.7 for storage
* Neutron with OVN
Logs & Configs
==============
sosreport attached.
[0] https://opendev.org/openstack/nova/commit/efc27ff84c3
[1] https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1724999
[3] https://jaas.ai/openstack-base
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1915255/+subscriptions
References