← Back to team overview

kernel-packages team mailing list archive

[Bug 1473883] Re: Kernel panics on mlx4_core (Mellanox Core driver) with SR-IOV mode

 

Sent patches to k-team ML for Vivid.
Since these are in 4.1 should be picked up in Wily when we rebase.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1473883

Title:
  Kernel panics on mlx4_core (Mellanox Core driver) with SR-IOV mode

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Vivid:
  In Progress

Bug description:
  SRU Justification:

  [Impact]

  While load/unload mlx4_core twice  with SR-IOV mode enabled in host
  with multiple Mellanox devices (some of them support SR-IOV and other
  don't) this will lead to kernel panic.

  [Fix]

  commit 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36
  commit ed3d2276ef72be23c6367358d80004130d8c797d

  $ git describe 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36 ed3d2276ef72be23c6367358d80004130d8c797d
  v4.1-rc6-1067-g5114a04
  v4.1-rc6-1068-ged3d227

  [Test Case]

  1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
  2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
  3- load mlx4_en kernel module:  modprobe -v mlx4_en
  4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
  5 -repeat 2 and 3
  6- will get the following call trace.

  --

  While load/unload mlx4_core twice  with SR-IOV mode enabled in host
  with multiple Mellanox devices (some of them support SR-IOV and other
  don't) this will lead to kernel panic.

  The following two upstream commits fix this issue:

  commit 32b4ca5af1cf1c558dfca0e3417e9b35402401a6
  Author: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
  Date:   Tue Jun 2 16:07:23 2015 -0500

      net/mlx4_core: double free of dev_vfs

      If user loads mlx4_core with num_vfs greater than
      supported then variable dev->dev_vfs is freed 2 times after unloading the
      driver.

      Acked-by: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
      Signed-off-by: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
      Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>

  commit 7095b39f3189d2107045d769fdc32dfc0b704028
  Author: Carol Soto <clsoto@xxxxxxxxxxxxxxxxxx>
  Date:   Tue Jun 2 16:07:24 2015 -0500

      net/mlx4_core: need to call close fw if alloc icm is called twice

      If mlx4_enable_sriov is called by adapter without this
      feature MLX4_DEV_CAP_FLAG2_SYS_EQS then during this path the function alloc
      icm is called twice without freeing the structures from the first time.

      Acked-by: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
      Signed-off-by: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
      Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>

  Steps to reproduce:
  1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
  2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
  3- load mlx4_en kernel module:  modprobe -v mlx4_en
  4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
  5 -repeat 2 and 3
  6- will get the following call trace.

  Call Trace:
   1175.699487] mlx4_core 0000:24:00.0: Received reset from slave:7
  [ 1175.767388] mlx4_core 0000:24:00.0: Received reset from slave:6
  [ 1175.830898] mlx4_core 0000:24:00.0: Received reset from slave:5
  [ 1175.898229] mlx4_core 0000:24:00.0: Received reset from slave:4
  [ 1175.963514] mlx4_core 0000:24:00.0: Received reset from slave:3
  [ 1176.035312] mlx4_core 0000:24:00.0: Received reset from slave:2
  [ 1176.105085] mlx4_core 0000:24:00.0: Received reset from slave:1
  [ 1177.253200] mlx4_core 0000:24:00.0: Disabling SR-IOV
  [ 1179.724864] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
  [ 1179.724885] mlx4_core: Initializing 0000:21:00.0
  [ 1185.760555] mlx4_core 0000:21:00.0: Enabling SR-IOV with 60 VFs
  [ 1185.760575] mlx4_core 0000:21:00.0: Failed to enable SR-IOV, continuing without SR-IOV (err = -22)
  [ 1185.770550] mlx4_core 0000:21:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
  [ 1185.770552] mlx4_core 0000:21:00.0: PCIe link width is x8, device supports x8
  [ 1185.771870] ------------[ cut here ]------------
  [ 1185.771878] WARNING: CPU: 6 PID: 5947 at /build/buildd/linux-3.19.0/fs/sysfs/dir.c:31 sysfs_warn_dup+0x68/0x80()
  [ 1185.771880] sysfs: cannot create duplicate filename '/devices/pci0000:20/0000:20:03.0/0000:21:00.0/msi_irqs/57'
  [ 1185.771881] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
  [ 1185.771920]  nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
  [ 1185.771931] CPU: 6 PID: 5947 Comm: modprobe Tainted: G           OE  3.19.0-16-generic #16-Ubuntu
  [ 1185.771932] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
  [ 1185.771934]  ffffffff81abb6d8 ffff88086cdb37c8 ffffffff817c2235 0000000000000007
  [ 1185.771936]  ffff88086cdb3818 ffff88086cdb3808 ffffffff8107595a 0000000000000292
  [ 1185.771938]  ffff88084d1ea000 ffff88086d1c1648 ffff8807b3df62d0 ffff880867ab85a0
  [ 1185.771941] Call Trace:
  [ 1185.771949]  [<ffffffff817c2235>] dump_stack+0x45/0x57
  [ 1185.771953]  [<ffffffff8107595a>] warn_slowpath_common+0x8a/0xc0
  [ 1185.771955]  [<ffffffff810759d6>] warn_slowpath_fmt+0x46/0x50
  [ 1185.771958]  [<ffffffff8126ab58>] ? kernfs_path+0x48/0x60
  [ 1185.771961]  [<ffffffff8126e508>] sysfs_warn_dup+0x68/0x80
  [ 1185.771963]  [<ffffffff8126e1ff>] sysfs_add_file_mode_ns+0x14f/0x1c0
  [ 1185.771966]  [<ffffffff8126c050>] ? kernfs_create_dir_ns+0x50/0x80
  [ 1185.771969]  [<ffffffff8126edf9>] internal_create_group+0xd9/0x280
  [ 1185.771971]  [<ffffffff8126f0d9>] sysfs_create_groups+0x49/0xa0
  [ 1185.771976]  [<ffffffff8141bfad>] populate_msi_sysfs+0x1bd/0x200
  [ 1185.771978]  [<ffffffff8141c4c8>] pci_enable_msix+0x158/0x3c0
  [ 1185.771980]  [<ffffffff8141c75d>] pci_enable_msix_range+0x2d/0x70
  [ 1185.771991]  [<ffffffffc0900245>] mlx4_load_one+0xea5/0x1410 [mlx4_core]
  [ 1185.771999]  [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
  [ 1185.772003]  [<ffffffff81401155>] local_pci_probe+0x45/0xa0
  [ 1185.772005]  [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
  [ 1185.772007]  [<ffffffff81402489>] pci_device_probe+0xd9/0x130
  [ 1185.772012]  [<ffffffff81506523>] driver_probe_device+0xa3/0x410
  [ 1185.772014]  [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
  [ 1185.772016]  [<ffffffff815068d0>] ? __device_attach+0x40/0x40
  [ 1185.772020]  [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
  [ 1185.772022]  [<ffffffff81505f8e>] driver_attach+0x1e/0x20
  [ 1185.772024]  [<ffffffff81505b60>] bus_add_driver+0x180/0x250
  [ 1185.772027]  [<ffffffffc0344000>] ? 0xffffffffc0344000
  [ 1185.772030]  [<ffffffff81507164>] driver_register+0x64/0xf0
  [ 1185.772034]  [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
  [ 1185.772042]  [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
  [ 1185.772047]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
  [ 1185.772053]  [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
  [ 1185.772058]  [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
  [ 1185.772061]  [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
  [ 1185.772063]  [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
  [ 1185.772067]  [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
  [ 1185.772072]  [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
  [ 1185.772074] ---[ end trace 9d9c0896e72e5312 ]---
  [ 1185.873139] mlx4_core 0000:21:00.0: command 0x31 timed out (go bit not cleared)
  [ 1185.873147] mlx4_core 0000:21:00.0: device is going to be reset
  [ 1186.881239] mlx4_core 0000:21:00.0: device was reset successfully
  [ 1186.888006] mlx4_core 0000:21:00.0: NOP command failed to generate interrupt (IRQ 53), aborting
  [ 1186.897831] mlx4_core 0000:21:00.0: BIOS or ACPI interrupt routing problem?
  [ 1186.907762] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
  [ 1186.916462] IP: [<ffffffff81181185>] __free_pages+0x5/0x30
  [ 1186.922560] PGD 0
  [ 1186.924814] Oops: 0002 [#1] SMP
  [ 1186.928423] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
  [ 1187.008078]  nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
  [ 1187.020643] CPU: 8 PID: 5947 Comm: modprobe Tainted: G        W  OE  3.19.0-16-generic #16-Ubuntu
  [ 1187.030455] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
  [ 1187.037778] task: ffff88079d6cb110 ti: ffff88086cdb0000 task.ti: ffff88086cdb0000
  [ 1187.046064] RIP: 0010:[<ffffffff81181185>]  [<ffffffff81181185>] __free_pages+0x5/0x30
  [ 1187.054859] RSP: 0018:ffff88086cdb39a0  EFLAGS: 00010206
  [ 1187.060730] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000000
  [ 1187.068610] RDX: 00000000000ffff8 RSI: 0000000000000014 RDI: 0000000000000000
  [ 1187.076492] RBP: ffff88086cdb39e8 R08: 0000000000000040 R09: 0000000000000000
  [ 1187.084374] R10: 0000000000000040 R11: ffff88079bbf6000 R12: ffff8807b3e20000
  [ 1187.092253] R13: ffff88086921a420 R14: ffff88086921a400 R15: 0000000000000001
  [ 1187.100139] FS:  00007fadaa1b9700(0000) GS:ffff88087f840000(0000) knlGS:0000000000000000
  [ 1187.109092] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1187.115445] CR2: 000000000000001c CR3: 0000000823f6f000 CR4: 00000000000407e0
  [ 1187.123336] Stack:
  [ 1187.125570]  ffffffffc08f9d9f 0000000000000099 ffff88086921a3e0 ffff88086cdb39e8
  [ 1187.133802]  0000000000000099 ffff8807b3e20000 ffff8807b3e23268 0000000000000099
  [ 1187.142030]  ffff8807b3e20000 ffff88086cdb3a18 ffffffffc08fab7c ffff8807b3e20000
  [ 1187.150264] Call Trace:
  [ 1187.153003]  [<ffffffffc08f9d9f>] ? mlx4_free_icm+0x17f/0x1d0 [mlx4_core]
  [ 1187.160526]  [<ffffffffc08fab7c>] mlx4_cleanup_icm_table+0x5c/0x80 [mlx4_core]
  [ 1187.168537]  [<ffffffffc08fb5bd>] mlx4_free_icms+0x1d/0x100 [mlx4_core]
  [ 1187.175849]  [<ffffffffc08fba8b>] mlx4_close_hca+0x4b/0x70 [mlx4_core]
  [ 1187.183072]  [<ffffffffc08ff943>] mlx4_load_one+0x5a3/0x1410 [mlx4_core]
  [ 1187.190480]  [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
  [ 1187.197786]  [<ffffffff81401155>] local_pci_probe+0x45/0xa0
  [ 1187.203944]  [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
  [ 1187.210485]  [<ffffffff81402489>] pci_device_probe+0xd9/0x130
  [ 1187.216842]  [<ffffffff81506523>] driver_probe_device+0xa3/0x410
  [ 1187.223478]  [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
  [ 1187.229643]  [<ffffffff815068d0>] ? __device_attach+0x40/0x40
  [ 1187.236002]  [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
  [ 1187.242256]  [<ffffffff81505f8e>] driver_attach+0x1e/0x20
  [ 1187.248222]  [<ffffffff81505b60>] bus_add_driver+0x180/0x250
  [ 1187.254479]  [<ffffffffc0344000>] ? 0xffffffffc0344000
  [ 1187.260158]  [<ffffffff81507164>] driver_register+0x64/0xf0
  [ 1187.266334]  [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
  [ 1187.273077]  [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
  [ 1187.280112]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
  [ 1187.286383]  [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
  [ 1187.293753]  [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
  [ 1187.300109]  [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
  [ 1187.306271]  [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
  [ 1187.312333]  [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
  [ 1187.318595]  [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
  [ 1187.325233] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a6 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 d3
  [ 1187.346856] RIP  [<ffffffff81181185>] __free_pages+0x5/0x30
  [ 1187.353034]  RSP <ffff88086cdb39a0>
  [ 1187.356900] CR2: 000000000000001c
  [ 1187.361080] ---[ end trace 9d9c0896e72e5313 ]---

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1473883/+subscriptions


References