← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1943863] Re: DPDK instances are failing to start: Failed to bind socket to /run/libvirt-vhost-user/vhu3ba44fdc-7c: No such file or directory

 

https://github.com/openstack-charmers/charm-layer-ovn/pull/52

** Also affects: neutron
   Importance: Undecided
       Status: New

** No longer affects: neutron

** No longer affects: neutron (Ubuntu)

** Also affects: charm-layer-ovn
   Importance: Undecided
       Status: New

** Changed in: charm-layer-ovn
       Status: New => Confirmed

** Changed in: charm-layer-ovn
   Importance: Undecided => High

** Changed in: charm-layer-ovn
     Assignee: (unassigned) => Liam Young (gnuoy)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1943863

Title:
  DPDK instances are failing to start: Failed to bind socket to
  /run/libvirt-vhost-user/vhu3ba44fdc-7c: No such file or directory

Status in charm-layer-ovn:
  Confirmed
Status in OpenStack nova-compute charm:
  Invalid

Bug description:
  == Env
  focal/ussuri + ovn, latest stable charms
  juju status: https://paste.ubuntu.com/p/2725tV47ym/
  Hardware: Huawei CH121 V5 with MZ532,4*25GE Mezzanine Card,PCIE 3.0 X16 NICs + manually installed PMD for DPDK enablement (librte-pmd-hinic20.0 package)
   
  == Problem description

  DPDK instance can't be launched after the fresh deployment
  (focal/ussuri + OVN, latest stable charms), raising a below error:

  $ os server show dpdk-test-instance -f yaml
  OS-DCF:diskConfig: MANUAL
  OS-EXT-AZ:availability_zone: ''
  OS-EXT-SRV-ATTR:host: null
  OS-EXT-SRV-ATTR:hypervisor_hostname: null
  OS-EXT-SRV-ATTR:instance_name: instance-00000218
  OS-EXT-STS:power_state: NOSTATE
  OS-EXT-STS:task_state: null
  OS-EXT-STS:vm_state: error
  OS-SRV-USG:launched_at: null
  OS-SRV-USG:terminated_at: null
  accessIPv4: ''
  accessIPv6: ''
  addresses: ''
  config_drive: 'True'
  created: '2021-09-15T18:51:00Z'
  fault:
    code: 500
    created: '2021-09-15T18:52:01Z'
    details: "Traceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/conductor/manager.py\"\
      , line 651, in build_instances\n    scheduler_utils.populate_retry(\n  File \"\
      /usr/lib/python3/dist-packages/nova/scheduler/utils.py\", line 919, in populate_retry\n\
      \    raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded:\
      \ Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance\
      \ 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73. Last exception: internal error: process\
      \ exited while connecting to monitor: 2021-09-15T18:51:53.485265Z qemu-system-x86_64:\
      \ -chardev socket,id=charnet0,path=/run/libvirt-vhost-user/vhu3ba44fdc-7c,server:\
      \ Failed to bind socket to /run/libvirt-vhost-user/vhu3ba44fdc-7c: No such file\
      \ or directory\n"
    message: 'Exceeded maximum number of retries. Exceeded max scheduling attempts 3
      for instance 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73. Last exception: internal error:
      process exited while connecting to monitor: 2021-09-15T18:51:53.485265Z qemu-system-x86_64:
      -chardev '
  flavor: m1.medium.project.dpdk (4f452aa3-2b2c-4f2e-8465-5e3c2d8ec3f1)
  hostId: ''
  id: 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73
  image: auto-sync/ubuntu-bionic-18.04-amd64-server-20210907-disk1.img (3851450e-e73d-489b-a356-33650690ed7a)
  key_name: ubuntu-keypair
  name: dpdk-test-instance
  project_id: cdade870811447a89e2f0199373a0d95
  properties: ''
  status: ERROR
  updated: '2021-09-15T18:52:01Z'
  user_id: 13a0e7862c6641eeaaebbde1ae096f9e
  volumes_attached: ''

  For the record, a "generic" instances (e.g non-DPDK/non-SRIOV) are
  scheduling/starting without any issues.

  == Steps to reproduce

  openstack network create --external --provider-network-type vlan --provider-segment xxx --provider-physical-network dpdkfabric ext_net_dpdk
  openstack subnet create --allocation-pool start=<redacted>,end=<redacted> --network ext_net_dpdk --subnet-range <redacted>/23 --gateway <redacted> --no-dhcp ext_net_dpdk_subnet

  openstack aggregate create --zone nova dpdk
  openstack aggregate set --property dpdk=true dpdk

  openstack aggregate add host dpdk <fqdn>

  openstack aggregate show dpdk --max-width=80

  openstack flavor set --property
  aggregate_instance_extra_specs:dpdk=true --property
  hw:mem_page_size=large m1.medium.dpdk

  openstack server create --config-drive true --network ext_net_dpdk
  --key-name ubuntu-keypair --image focal --flavor m1.medium.dpdk dpdk-
  test-instance

  == Analysis
  [before redeployment] nova-compute log : https://pastebin.canonical.com/p/FgPYNb3bPj/
  [fresh deployment] juju crashdump: https://drive.google.com/file/d/1W_w3CAUq4ggp4alDnpCk08mSaCL6Uaxk/view?usp=sharing

  <on hypervisor>

  # ovs-vsctl get open_vswitch . other_config
  {dpdk-extra="--pci-whitelist 0000:3e:00.0 --pci-whitelist 0000:40:00.0", dpdk-init="true", dpdk-lcore-mask="0x1000001", dpdk-socket-mem="4096,4096"}

  # cat /etc/tmpfiles.d/nova-ovs-vhost-user.conf
  # Create libvirt writeable directory for vhost-user sockets
  d /run/libvirt-vhost-user 0770 libvirt-qemu kvm - -

  In fact, none of the compute hosts have that file:
  https://paste.ubuntu.com/p/XJRFypbMQf/ (however, the error from this
  issue doesn't appear on non-DPDK hosts).

  After doing the below command, that missing /run/... file has appeared
  and VM could have been scheduled and started. However, although it
  have been started, it wasn't reachable over the network.

  # systemd-tmpfiles --create
  # stat /run/libvirt-vhost-user
    File: /run/libvirt-vhost-user
    Size: 40              Blocks: 0          IO Block: 4096   directory

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-layer-ovn/+bug/1943863/+subscriptions