kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #129914
[Bug 1473883] Re: Kernel panics on mlx4_core (Mellanox Core driver) with SR-IOV mode
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
vivid' to 'verification-done-vivid'.
If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.
See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!
** Tags added: verification-needed-vivid
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1473883
Title:
Kernel panics on mlx4_core (Mellanox Core driver) with SR-IOV mode
Status in linux package in Ubuntu:
Triaged
Status in linux source package in Vivid:
Fix Committed
Bug description:
SRU Justification:
[Impact]
While load/unload mlx4_core twice with SR-IOV mode enabled in host
with multiple Mellanox devices (some of them support SR-IOV and other
don't) this will lead to kernel panic.
[Fix]
commit 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36
commit ed3d2276ef72be23c6367358d80004130d8c797d
$ git describe 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36 ed3d2276ef72be23c6367358d80004130d8c797d
v4.1-rc6-1067-g5114a04
v4.1-rc6-1068-ged3d227
[Test Case]
1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
3- load mlx4_en kernel module: modprobe -v mlx4_en
4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
5 -repeat 2 and 3
6- will get the following call trace.
--
While load/unload mlx4_core twice with SR-IOV mode enabled in host
with multiple Mellanox devices (some of them support SR-IOV and other
don't) this will lead to kernel panic.
The following two upstream commits fix this issue:
commit 32b4ca5af1cf1c558dfca0e3417e9b35402401a6
Author: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
Date: Tue Jun 2 16:07:23 2015 -0500
net/mlx4_core: double free of dev_vfs
If user loads mlx4_core with num_vfs greater than
supported then variable dev->dev_vfs is freed 2 times after unloading the
driver.
Acked-by: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
Signed-off-by: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
commit 7095b39f3189d2107045d769fdc32dfc0b704028
Author: Carol Soto <clsoto@xxxxxxxxxxxxxxxxxx>
Date: Tue Jun 2 16:07:24 2015 -0500
net/mlx4_core: need to call close fw if alloc icm is called twice
If mlx4_enable_sriov is called by adapter without this
feature MLX4_DEV_CAP_FLAG2_SYS_EQS then during this path the function alloc
icm is called twice without freeing the structures from the first time.
Acked-by: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
Signed-off-by: Carol L Soto <clsoto@xxxxxxxxxxxxxxxxxx>
Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
Steps to reproduce:
1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
3- load mlx4_en kernel module: modprobe -v mlx4_en
4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
5 -repeat 2 and 3
6- will get the following call trace.
Call Trace:
1175.699487] mlx4_core 0000:24:00.0: Received reset from slave:7
[ 1175.767388] mlx4_core 0000:24:00.0: Received reset from slave:6
[ 1175.830898] mlx4_core 0000:24:00.0: Received reset from slave:5
[ 1175.898229] mlx4_core 0000:24:00.0: Received reset from slave:4
[ 1175.963514] mlx4_core 0000:24:00.0: Received reset from slave:3
[ 1176.035312] mlx4_core 0000:24:00.0: Received reset from slave:2
[ 1176.105085] mlx4_core 0000:24:00.0: Received reset from slave:1
[ 1177.253200] mlx4_core 0000:24:00.0: Disabling SR-IOV
[ 1179.724864] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[ 1179.724885] mlx4_core: Initializing 0000:21:00.0
[ 1185.760555] mlx4_core 0000:21:00.0: Enabling SR-IOV with 60 VFs
[ 1185.760575] mlx4_core 0000:21:00.0: Failed to enable SR-IOV, continuing without SR-IOV (err = -22)
[ 1185.770550] mlx4_core 0000:21:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 1185.770552] mlx4_core 0000:21:00.0: PCIe link width is x8, device supports x8
[ 1185.771870] ------------[ cut here ]------------
[ 1185.771878] WARNING: CPU: 6 PID: 5947 at /build/buildd/linux-3.19.0/fs/sysfs/dir.c:31 sysfs_warn_dup+0x68/0x80()
[ 1185.771880] sysfs: cannot create duplicate filename '/devices/pci0000:20/0000:20:03.0/0000:21:00.0/msi_irqs/57'
[ 1185.771881] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
[ 1185.771920] nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
[ 1185.771931] CPU: 6 PID: 5947 Comm: modprobe Tainted: G OE 3.19.0-16-generic #16-Ubuntu
[ 1185.771932] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
[ 1185.771934] ffffffff81abb6d8 ffff88086cdb37c8 ffffffff817c2235 0000000000000007
[ 1185.771936] ffff88086cdb3818 ffff88086cdb3808 ffffffff8107595a 0000000000000292
[ 1185.771938] ffff88084d1ea000 ffff88086d1c1648 ffff8807b3df62d0 ffff880867ab85a0
[ 1185.771941] Call Trace:
[ 1185.771949] [<ffffffff817c2235>] dump_stack+0x45/0x57
[ 1185.771953] [<ffffffff8107595a>] warn_slowpath_common+0x8a/0xc0
[ 1185.771955] [<ffffffff810759d6>] warn_slowpath_fmt+0x46/0x50
[ 1185.771958] [<ffffffff8126ab58>] ? kernfs_path+0x48/0x60
[ 1185.771961] [<ffffffff8126e508>] sysfs_warn_dup+0x68/0x80
[ 1185.771963] [<ffffffff8126e1ff>] sysfs_add_file_mode_ns+0x14f/0x1c0
[ 1185.771966] [<ffffffff8126c050>] ? kernfs_create_dir_ns+0x50/0x80
[ 1185.771969] [<ffffffff8126edf9>] internal_create_group+0xd9/0x280
[ 1185.771971] [<ffffffff8126f0d9>] sysfs_create_groups+0x49/0xa0
[ 1185.771976] [<ffffffff8141bfad>] populate_msi_sysfs+0x1bd/0x200
[ 1185.771978] [<ffffffff8141c4c8>] pci_enable_msix+0x158/0x3c0
[ 1185.771980] [<ffffffff8141c75d>] pci_enable_msix_range+0x2d/0x70
[ 1185.771991] [<ffffffffc0900245>] mlx4_load_one+0xea5/0x1410 [mlx4_core]
[ 1185.771999] [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
[ 1185.772003] [<ffffffff81401155>] local_pci_probe+0x45/0xa0
[ 1185.772005] [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
[ 1185.772007] [<ffffffff81402489>] pci_device_probe+0xd9/0x130
[ 1185.772012] [<ffffffff81506523>] driver_probe_device+0xa3/0x410
[ 1185.772014] [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
[ 1185.772016] [<ffffffff815068d0>] ? __device_attach+0x40/0x40
[ 1185.772020] [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
[ 1185.772022] [<ffffffff81505f8e>] driver_attach+0x1e/0x20
[ 1185.772024] [<ffffffff81505b60>] bus_add_driver+0x180/0x250
[ 1185.772027] [<ffffffffc0344000>] ? 0xffffffffc0344000
[ 1185.772030] [<ffffffff81507164>] driver_register+0x64/0xf0
[ 1185.772034] [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
[ 1185.772042] [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
[ 1185.772047] [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[ 1185.772053] [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
[ 1185.772058] [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
[ 1185.772061] [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
[ 1185.772063] [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
[ 1185.772067] [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
[ 1185.772072] [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
[ 1185.772074] ---[ end trace 9d9c0896e72e5312 ]---
[ 1185.873139] mlx4_core 0000:21:00.0: command 0x31 timed out (go bit not cleared)
[ 1185.873147] mlx4_core 0000:21:00.0: device is going to be reset
[ 1186.881239] mlx4_core 0000:21:00.0: device was reset successfully
[ 1186.888006] mlx4_core 0000:21:00.0: NOP command failed to generate interrupt (IRQ 53), aborting
[ 1186.897831] mlx4_core 0000:21:00.0: BIOS or ACPI interrupt routing problem?
[ 1186.907762] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 1186.916462] IP: [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1186.922560] PGD 0
[ 1186.924814] Oops: 0002 [#1] SMP
[ 1186.928423] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
[ 1187.008078] nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
[ 1187.020643] CPU: 8 PID: 5947 Comm: modprobe Tainted: G W OE 3.19.0-16-generic #16-Ubuntu
[ 1187.030455] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
[ 1187.037778] task: ffff88079d6cb110 ti: ffff88086cdb0000 task.ti: ffff88086cdb0000
[ 1187.046064] RIP: 0010:[<ffffffff81181185>] [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1187.054859] RSP: 0018:ffff88086cdb39a0 EFLAGS: 00010206
[ 1187.060730] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000000
[ 1187.068610] RDX: 00000000000ffff8 RSI: 0000000000000014 RDI: 0000000000000000
[ 1187.076492] RBP: ffff88086cdb39e8 R08: 0000000000000040 R09: 0000000000000000
[ 1187.084374] R10: 0000000000000040 R11: ffff88079bbf6000 R12: ffff8807b3e20000
[ 1187.092253] R13: ffff88086921a420 R14: ffff88086921a400 R15: 0000000000000001
[ 1187.100139] FS: 00007fadaa1b9700(0000) GS:ffff88087f840000(0000) knlGS:0000000000000000
[ 1187.109092] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1187.115445] CR2: 000000000000001c CR3: 0000000823f6f000 CR4: 00000000000407e0
[ 1187.123336] Stack:
[ 1187.125570] ffffffffc08f9d9f 0000000000000099 ffff88086921a3e0 ffff88086cdb39e8
[ 1187.133802] 0000000000000099 ffff8807b3e20000 ffff8807b3e23268 0000000000000099
[ 1187.142030] ffff8807b3e20000 ffff88086cdb3a18 ffffffffc08fab7c ffff8807b3e20000
[ 1187.150264] Call Trace:
[ 1187.153003] [<ffffffffc08f9d9f>] ? mlx4_free_icm+0x17f/0x1d0 [mlx4_core]
[ 1187.160526] [<ffffffffc08fab7c>] mlx4_cleanup_icm_table+0x5c/0x80 [mlx4_core]
[ 1187.168537] [<ffffffffc08fb5bd>] mlx4_free_icms+0x1d/0x100 [mlx4_core]
[ 1187.175849] [<ffffffffc08fba8b>] mlx4_close_hca+0x4b/0x70 [mlx4_core]
[ 1187.183072] [<ffffffffc08ff943>] mlx4_load_one+0x5a3/0x1410 [mlx4_core]
[ 1187.190480] [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
[ 1187.197786] [<ffffffff81401155>] local_pci_probe+0x45/0xa0
[ 1187.203944] [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
[ 1187.210485] [<ffffffff81402489>] pci_device_probe+0xd9/0x130
[ 1187.216842] [<ffffffff81506523>] driver_probe_device+0xa3/0x410
[ 1187.223478] [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
[ 1187.229643] [<ffffffff815068d0>] ? __device_attach+0x40/0x40
[ 1187.236002] [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
[ 1187.242256] [<ffffffff81505f8e>] driver_attach+0x1e/0x20
[ 1187.248222] [<ffffffff81505b60>] bus_add_driver+0x180/0x250
[ 1187.254479] [<ffffffffc0344000>] ? 0xffffffffc0344000
[ 1187.260158] [<ffffffff81507164>] driver_register+0x64/0xf0
[ 1187.266334] [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
[ 1187.273077] [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
[ 1187.280112] [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[ 1187.286383] [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
[ 1187.293753] [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
[ 1187.300109] [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
[ 1187.306271] [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
[ 1187.312333] [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
[ 1187.318595] [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
[ 1187.325233] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a6 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 d3
[ 1187.346856] RIP [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1187.353034] RSP <ffff88086cdb39a0>
[ 1187.356900] CR2: 000000000000001c
[ 1187.361080] ---[ end trace 9d9c0896e72e5313 ]---
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1473883/+subscriptions
References