kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #104369
[Bug 1422481] [NEW] mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)
You have been subscribed to a public bug:
---Problem Description---
EEH is not working with mlx4 driver. When the driver recovered it hits another EEH.
---uname output---
Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
Need Mellanox adapter like Connect 3 adapter.
Machine Type = P8
---Steps to Reproduce---
Just inject EEH to mlx4 device.
Stack trace output:
from EEH recovery then it hits this:
[ 188.747571] EEH: Collect temporary log
[ 188.748330] EEH: of node=/pci@800000020000007/ethernet@3
[ 188.748339] EEH: PCI device/vendor: 100715b3
[ 188.748361] EEH: PCI cmd/status register: 00100146
[ 188.748362] EEH: PCI-E capabilities and status follow:
[ 188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
[ 188.748537] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 188.748539] EEH: PCI-E 20: 00000000
[ 188.748540] EEH: PCI-E AER capability register set follows:
[ 188.748625] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 188.748704] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 188.748783] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 188.748805] EEH: PCI-E AER 30: 00000000 00000000
[ 188.748813] EEH: Reset without hotplug activity
[ 193.833245] EEH: Notify device drivers the completion of reset
[ 193.833257] mlx4_core: Initializing 0001:00:03.0
[ 193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 0x170b0000000-0x170b00fffff]
[ 193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting
[ 193.833395] EEH: Not recovered
[ 193.833397] EEH: Unable to recover from failure from PHB#1-PE#1.
Please try reseating or replacing it
[ 193.834531] EEH: of node=/pci@800000020000007/ethernet@3
[ 193.834547] EEH: PCI device/vendor: 100715b3
[ 193.834580] EEH: PCI cmd/status register: 00100142
[ 193.834582] EEH: PCI-E capabilities and status follow:
[ 193.834728] EEH: PCI-E 00: 00020010 10008e02 0000200e 0843f483
[ 193.834846] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 193.834849] EEH: PCI-E 20: 00000000
[ 193.834850] EEH: PCI-E AER capability register set follows:
[ 193.834981] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 193.835101] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 193.835219] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 193.835252] EEH: PCI-E AER 30: 00000000 00000000
[ 193.835289] Unable to handle kernel paging request for data at address 0x00000388
[ 193.835356] Faulting instruction address: 0xd000000001f3231c
[ 193.835415] Oops: Kernel access of bad area, sig: 11 [#1]
[ 193.835460] SMP NR_CPUS=2048 NUMA pSeries
[ 193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc rtc_generic mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core
[ 193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13
[ 193.835942] task: c0000003f72ca880 ti: c0000003f707c000 task.ti: c0000003f707c000
[ 193.836009] NIP: d000000001f3231c LR: d000000001f32790 CTR: d000000001f32760
[ 193.836076] REGS: c0000003f707f790 TRAP: 0300 Not tainted (3.18.0-12-generic)
[ 193.836141] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44000048 XER: 20000000
[ 193.836302] CFAR: c0000000000a7be0 DAR: 0000000000000388 DSISR: 40000000 SOFTE: 1
GPR00: d000000001f32790 c0000003f707fa10 d000000001f66310 c0000003fe0ad000
GPR04: 0000000000000003 0000000000000000 0000000000000000 c0000003fd000000
GPR08: 0000000000000001 d000000001f32760 00000000fffffffa 0000000100001001
GPR12: d000000001f32760 c00000000fb83600 c0000000000d9118 c0000003f90e56c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c4ab90
GPR24: c000000000c4ab68 0000000000100100 c0000003fe068580 c0000003fe068580
GPR28: c0000003fe0ad000 c0000003fe0685e0 d000000001f5da50 0000000000000000
[ 193.837205] NIP [d000000001f3231c] mlx4_unload_one+0x3c/0x480 [mlx4_core]
[ 193.837269] LR [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
[ 193.837336] Call Trace:
[ 193.837361] [c0000003f707fa10] [c0000003fe068580] 0xc0000003fe068580 (unreliable)
[ 193.837447] [c0000003f707faa0] [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
[ 193.837528] [c0000003f707fae0] [c00000000003ac64] eeh_report_failure+0xb4/0xf0
[ 193.837606] [c0000003f707fb10] [c0000000000393b4] eeh_pe_dev_traverse+0x94/0x160
[ 193.837685] [c0000003f707fba0] [c00000000003b148] eeh_handle_normal_event+0xa8/0x400
[ 193.837764] [c0000003f707fc20] [c00000000003b6b4] eeh_handle_event+0x54/0x360
[ 193.837832] [c0000003f707fcd0] [c00000000003bae4] eeh_event_handler+0x124/0x1d0
[ 193.837911] [c0000003f707fd80] [c0000000000d9220] kthread+0x110/0x130
[ 193.837980] [c0000003f707fe30] [c000000000009568] ret_from_kernel_thread+0x5c/0x74
[ 193.838057] Instruction dump:
[ 193.838094] fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71
[ 193.838217] 7c7c1b78 48000008 e8410018 ebfc0138 <813f0388> 2f890000 409e020c e93f0008
[ 193.838341] ---[ end trace 7cd21329722bcbd1 ]---
There is a series of patches in this link that should resolve this issue.
http://permalink.gmane.org/gmane.linux.network/347777
I had applied these in upstream kernel and it is ok but let me double check with Ubuntu 15.04 kernel if these are the patches we need to solve this bugzilla.
I used this kernel from Ubuntu 15.04 3.18.0-13.14
To make EEH work, to try to reach the first 2 patches of that series I have to use all this patches:
>From ca9f9f703950e5cb300526549b4f1b0a6605a5c5 Mon Sep 17 00:00:00 2001
From: Amir Vadai <amirv@xxxxxxxxxxxx>
Date: Tue, 25 Feb 2014 18:17:52 +0200
Subject: net/mlx4_en: Fix bad use of dev_id
>From adbc7ac5c15eb5e9d70393428345e72a1a897d6a Mon Sep 17 00:00:00 2001
From: Saeed Mahameed <saeedm@xxxxxxxxxxxx>
Date: Mon, 27 Oct 2014 11:37:37 +0200
Subject: net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap
>From a53e3e8c1db547981e13d1ebf24a659bd4e87710 Mon Sep 17 00:00:00 2001
From: Saeed Mahameed <saeedm@xxxxxxxxxxxx>
Date: Mon, 27 Oct 2014 11:37:38 +0200
Subject: net/mlx4_core: Add ethernet backplane autoneg device capability
>From d475c95b4bcff983ac76e8522bfd2d29bcc567d0 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Sun, 2 Nov 2014 16:26:17 +0200
Subject: net/mlx4_core: Add retrieval of CONFIG_DEV parameters
>From dd65beac48a5259945846956d4b27344dfb73bd9 Mon Sep 17 00:00:00 2001
From: Shani Michaeli <shanim@xxxxxxxxxxxx>
Date: Sun, 9 Nov 2014 13:51:52 +0200
Subject: net/mlx4_en: Extend usage of napi_gro_frags
>From f8c6455bb04b944edb69e9b074e28efee2c56bdd Mon Sep 17 00:00:00 2001
From: Shani Michaeli <shanim@xxxxxxxxxxxx>
Date: Sun, 9 Nov 2014 13:51:53 +0200
Subject: net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
>From ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:29 +0200
Subject: net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup
>From a0eacca948d2d4531a393d82a736ff19b7b8fa0b Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:30 +0200
Subject: net/mlx4_core: Refactor mlx4_load_one
>From e8c4265bea8437f5583d0c2f272058200ebc10ff Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:31 +0200
Subject: net/mlx4_core: Add QUERY_FUNC firmware command
>From 7ae0e400cd9396c41fe596d35dcc34feaa89a04f Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:32 +0200
Subject: net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X
vectors for PF/VFs
>From da315679e80635021e98de1306ff4eee0759ba57 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Sun, 14 Dec 2014 16:18:04 +0200
Subject: net/mlx4_core: Fixed memory leak and incorrect refcount in
with those patches I can apply from the series that I pointed:
==> 0001-net-mlx4_core-Maintain-a-persistent-memory-for-mlx4-.patch <==
>From 872bf2fb69d90e3619befee842fc26db39d8e475 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:35 +0200
Subject: net/mlx4_core: Maintain a persistent memory for mlx4 device
==> 0002-net-mlx4_core-Set-device-configuration-data-to-be-pe.patch <==
>From dd0eefe3abbf47442db296bf68f27eb2860c1cdf Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:36 +0200
Subject: net/mlx4_core: Set device configuration data to be persistent across
reset
==> 0003-net-mlx4_core-Refactor-the-catas-flow-to-work-per-de.patch <==
>From ad9a0bf08ffbf32b8f292c3bb78ca0f24bb8f6b2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:37 +0200
Subject: net/mlx4_core: Refactor the catas flow to work per device
==> 0004-net-mlx4_core-Enhance-the-catas-flow-to-support-devi.patch <==
>From f6bc11e42646e661e699a5593cbd1e9dba7191d0 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:38 +0200
Subject: net/mlx4_core: Enhance the catas flow to support device reset
==> 0005-net-mlx4_core-Activate-reset-flow-upon-fatal-command.patch <==
>From f5aef5aa35063f2b45c3605871cd525d0cb7fb7a Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:39 +0200
Subject: net/mlx4_core: Activate reset flow upon fatal command cases
==> 0006-net-mlx4_core-Manage-interface-state-for-Reset-flow-.patch <==
>From c69453e294c9f16da977b68e658a8028b854c209 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:40 +0200
Subject: net/mlx4_core: Manage interface state for Reset flow cases
==> 0007-net-mlx4_core-Handle-AER-flow-properly.patch <==
>From 2ba5fbd62b2534335f4e3b844ecc7860115525a3 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:41 +0200
Subject: net/mlx4_core: Handle AER flow properly
but to apply the whole series to include SRIOV EEH, then I need these extra packages:
==> 0008-g-mlx4.patch <==
>From 225c6c8c6bbbc32455df3d1c0fb1e1e1fb51c533 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:28 +0200
Subject: net/mlx4_core: Use correct variable type for mlx4_slave_cap
==> 0008-l-mlx4.patch <==
>From de966c5928026b100a989c8cef761d306310a184 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 13 Nov 2014 14:45:33 +0200
Subject: net/mlx4_core: Support more than 64 VFs
==> 0008-m-mlx4.patch <==
>From 383677da43fa83b390888cf7d25885166b2a6812 Mon Sep 17 00:00:00 2001
From: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:52 +0200
Subject: net/mlx4_core: Mask out host side virtualization features for guests
==> 0008-net-mlx4_core-Enable-device-recovery-flow-with-SRIOV.patch <==
>From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV
==> 0008-n-mlx4.patch <==
>From ddae0349fdb78bcc5e7219061847012aa1a29069 Mon Sep 17 00:00:00 2001
From: Eugenia Emantayev <eugenia@xxxxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:54 +0200
Subject: net/mlx4: Change QP allocation scheme
==> 0008-o-mlx4.patch <==
>From 431df8c7e9708433459fd806a08308997de43121 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:59 +0200
Subject: net/mlx4: Refactor QUERY_PORT
==> 0008-p-mlx4.patch <==
>From ab256e5ad02b36951f01bf6b5cfda25f14820847 Mon Sep 17 00:00:00 2001
From: Dotan Barak <dotanb@xxxxxxxxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:55 +0200
Subject: net/mlx4: Add a check if there are too many reserved QPs
==> 0008-r-mlx4.patch <==
>From d57febe1a47801ef8a55dbf10672850523dfaa60 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:57 +0200
Subject: net/mlx4: Add A0 hybrid steering
==> 0008-s-mlx4.patch <==
>From 7d077cd34eabb2ffd05abe0f2cad01da1ef11712 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:58:00 +0200
Subject: net/mlx4: Add support for A0 steering
==> 0008-z-mlx4.patch <==
>From 7a89399ffad7b7c47b43afda010309b3b88538c0 Mon Sep 17 00:00:00 2001
From: Matan Barak <matanb@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2014 10:57:56 +0200
Subject: net/mlx4: Add mlx4_bitmap zone allocator
So then I can apply these
>From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV
==> 0009-net-mlx4_core-Reset-flow-activation-upon-SRIOV-fatal.patch <==
>From 0cd9302734111abc0b5912b695336f2ee63cb22b Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
Date: Sun, 25 Jan 2015 16:59:43 +0200
Subject: net/mlx4_core: Reset flow activation upon SRIOV fatal command cases
So basically to apply the series will need a lot of patches and probably
restest the driver.
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Tags: architecture-ppc64le bot-comment bugnameltc-121681 severity-high targetmilestone-inin1504
--
mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)
https://bugs.launchpad.net/bugs/1422481
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.