← Back to team overview

kernel-packages team mailing list archive

[Bug 1570195] Re: Net tools cause kernel soft lockup after DPDK touched VirtIO-pci devices

 

Before going into discussions how it "should" be I added more debug code
and gatherered some good case vs bad case data.

First of all it is "ok" to have no more buffers.
I had a prink in a codepath that only triggers when !more_used triggers.
And I've seen plentry for all kind of idx values.
On adding virtio traffic it triggers a few times as well.
Eventually that is what the loop is for, to wait until there is ia buffer that it can get.
So things aren't broken if this triggers ever - but of course it is if it never changes.

IIRC: last_used is != vring_used->idx just means nothing happened since
our last interaction (to be confirmed).

Good case:
Some !more_used might occur, but not related and not infintely
[  393.542550] __virtqueue_get_buf: No more buffers in vq ffff8801b74b3000 - vq->last_used_idx 303 == vq->vring.used->idx 303
[  394.097117] __virtqueue_get_buf: No more buffers in vq ffff8801b74b3000 - vq->last_used_idx 304 == vq->vring.used->idx 304
[  394.097413] __virtqueue_get_buf: No more buffers in vq ffff8801b74b4000 - vq->last_used_idx 125 == vq->vring.used->idx 125
[...]
[  394.449672] __virtqueue_get_buf: Entry checks passed - vq ffff8800bbaef000 from _vq ffff8800bbaef000
[  394.452734] __virtqueue_get_buf: Exit checks passed - ffff8801b74b5840 vq->data[i]
[  394.455087] __virtqueue_get_buf: Returning ret ffff8801b74b5840
Done

Bad case (after DPDK ran):
Now both debug printk's trigger
I get a LOT of
[  552.018862] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
Followed by a sequence like that in between
[  554.157376] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[  554.158916] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[  554.160135] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[  554.161583] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[  554.162776] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[  554.164189] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[...] (infinite loop)


Current assumption: DPDK disables something in the host part of the virtio device that makes the host no more response "correctly".
Via unbinding/binding the driver we can reinitialize that, but if not we will run into this hang.
Remember: we only initialize DPDK with testpmd, no load whatsoever is driven by it.

We likely need two fixes:
1. find what DPDK does "to" the device and avoid it
2. the kernel should give up after some number of retries or so and give up returning a fail (not good, but much better than hanging)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1570195

Title:
  Net tools cause kernel soft lockup after DPDK touched  VirtIO-pci
  devices

Status in dpdk package in Ubuntu:
  Confirmed
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Guys,

   I'm facing an issue here with both "ethtool" and "ip", while trying
  to manage black-listed by DPDK PCI VirtIO devices.

   You'll need an Ubuntu Xenial KVM guest, with 4 VirtIO vNIC cards, to
  run those tests

   PCI device example from inside a Xenial guest:

  ---
  # lspci | grep Ethernet
  00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:04.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:05.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:06.0 Ethernet controller: Red Hat, Inc Virtio network device
  ---

  Where "ens3" is the first / default interface, attached to Libvirt's
  "default" network. The "ens4" is reserved for "ethtool / ip" tests
  (attached to another Libvirt's network without IPs or DHCP), "ens5"
  will be "dpdk0" and "ens6" "dpdk1"...

  ---
   *** How it works?

   1- For example, try to enable multi-queue on DPDK's devices, boot
  your Xenial guest, and run:

   ethtool -L ens5 combined 4
   ethtool -L ens6 combined 4

   2- Install openvswitch-switch-dpdk configure DPDK and OVS and fire it
  up.

   https://help.ubuntu.com/16.04/serverguide/DPDK.html

   service openvswitch-switch stop
   service dpdk stop

   OVS DPDK Options (/etc/default/openvswitch-switch):

  --
  DPDK_OPTS='--dpdk -c 0x1 -n 4 --socket-mem 1024 --pci-blacklist 0000:00:03.0,0000:00:04.0'
  --

   service dpdk start
   service openvswitch-switch start

   - Enable multi-queue on OVS+DPDK inside of the VM:

   ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4
   ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xff00

   * Multi-queue apparently works! ovs-vswitchd consumes more that 100%
  of CPU, meaning that it multi-queue is there...

   *** Where it fails?

   1- Reboot the VM and try to run ethtool again (or go straight to 2
  below):

   ethtool -L ens5 combined 4

   2- Try to fire up ens4:

   ip link set dev ens4 up

  
   # FAIL! Both commands hangs, consuming 100% of guest's CPU...

   So, it looks like a Linux fault, because it is "allowing" the DPDK
  VirtIO App (a user land App), to interfere with kernel devices in a
  strange way...

  Best,
  Thiago

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.4.0-18-generic 4.4.0-18.34
  ProcVersionSignature: Ubuntu 4.4.0-18.34-generic 4.4.6
  Uname: Linux 4.4.0-18-generic x86_64
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Apr 14 00:35 seq
   crw-rw---- 1 root audio 116, 33 Apr 14 00:35 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.1-0ubuntu1
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: [Errno 2] No such file or directory: 'fuser'
  CRDA: N/A
  Date: Thu Apr 14 01:27:27 2016
  HibernationDevice: RESUME=UUID=833e999c-e066-433c-b8a2-4324bb8d56de
  InstallationDate: Installed on 2016-04-07 (7 days ago)
  InstallationMedia: Ubuntu-Server 16.04 LTS "Xenial Xerus" - Beta amd64 (20160406)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
   Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: QEMU Standard PC (i440FX + PIIX, 1996)
  PciMultimedia:
   
  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-18-generic root=UUID=9911604e-353b-491f-a0a9-804724350592 ro
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-18-generic N/A
   linux-backports-modules-4.4.0-18-generic  N/A
   linux-firmware                            N/A
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 04/01/2014
  dmi.bios.vendor: SeaBIOS
  dmi.bios.version: Ubuntu-1.8.2-1ubuntu1
  dmi.chassis.type: 1
  dmi.chassis.vendor: QEMU
  dmi.chassis.version: pc-i440fx-wily
  dmi.modalias: dmi:bvnSeaBIOS:bvrUbuntu-1.8.2-1ubuntu1:bd04/01/2014:svnQEMU:pnStandardPC(i440FX+PIIX,1996):pvrpc-i440fx-wily:cvnQEMU:ct1:cvrpc-i440fx-wily:
  dmi.product.name: Standard PC (i440FX + PIIX, 1996)
  dmi.product.version: pc-i440fx-wily
  dmi.sys.vendor: QEMU

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/1570195/+subscriptions


References