← Back to team overview

kernel-packages team mailing list archive

[Bug 1422481] Comment bridged from LTC Bugzilla

 

------- Comment From clsoto@xxxxxxxxxx 2015-03-11 19:15 EDT-------
This looks fixed with  3.19.0-8-generic #8-Ubuntu
it was able to recover from EEH.

[ 2694.622586] EEH: Notify device drivers to shutdown
[ 2694.622587] mlx4_core 0004:01:00.0: device was reset successfully
[ 2694.622589] mlx4_core 0004:01:00.0: mlx4_pci_err_detected was called
[ 2694.622594] mlx4_en 0004:01:00.0: Internal error detected, restarting device
[ 2694.622786] mlx4_en: eth14: Close port called
[ 2694.846830] mlx4_en 0004:01:00.0: removed PHC
[ 2694.874036] EEH: Collect temporary log
[ 2694.879101] EEH: of node=/pciex@3fffe42000000/pci@0/ethernet@0
[ 2694.879465] EEH: PCI device/vendor: 100715b3
[ 2694.879478] EEH: PCI cmd/status register: 00100142
[ 2694.879479] EEH: PCI-E capabilities and status follow:
[ 2694.879544] EEH: PCI-E 00: 00020010 10008e02 0020204e 0843f483
[ 2694.879597] EEH: PCI-E 10: 10830040 00000000 00000000 00000000
[ 2694.879598] EEH: PCI-E 20: 00000000
[ 2694.879599] EEH: PCI-E AER capability register set follows:
[ 2694.879666] EEH: PCI-E AER 00: 18c20001 00000000 00000000 00062010
[ 2694.879719] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 2694.879772] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 2694.879785] EEH: PCI-E AER 30: 00000000 00000000
[ 2694.879787] PHB3 PHB#4 Diag-data (Version: 1)
[ 2694.879789] brdgCtl:     00000002
[ 2694.879790] UtlSts:      00200000 00000000 00000000
[ 2694.879791] RootSts:     00000040 00400000 f0830048 00100147 00000000
[ 2694.879792] PhbSts:      0000001c00000000 0000001c00000000
[ 2694.879793] Lem:         0000000000100000 42498e327f502eae 0000000000000000
[ 2694.879795] InAErr:      8000000000000000 8000000000000000 0402008000000000 0000000000000000
[ 2694.879796] PE[  1] A/B: 8480002b00000000 8000000000000000
[ 2694.879797] PE[  2] A/B: 8000000000000000 8000000000000000
[ 2694.879798] PE[  3] A/B: 8000000000000000 8000000000000000
[ 2694.879799] PE[  4] A/B: 8000000000000000 8000000000000000
[ 2694.879800] PE[  5] A/B: 8000000000000000 8000000000000000
[ 2694.879801] EEH: Reset without hotplug activity
[ 2698.898176] EEH: Notify device drivers the completion of reset
[ 2698.898181] mlx4_core 0004:01:00.0: mlx4_pci_slot_reset was called
[ 2698.898218] mlx4_core 0004:01:00.0: enabling device (0140 -> 0142)
[ 2705.396286] mlx4_core 0004:01:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 2705.396288] mlx4_core 0004:01:00.0: PCIe link width is x8, device supports x8
[ 2706.143789] mlx4_en 0004:01:00.0: registered PHC clock
[ 2706.143864] mlx4_en 0004:01:00.0: Activating port:1
[ 2706.159496] mlx4_en: eth11: Using 256 TX rings
[ 2706.159504] mlx4_en: eth11: Using 8 RX rings
[ 2706.159506] mlx4_en: eth11:   frag:0 - size:1518 prefix:0 stride:1536
[ 2706.159722] mlx4_en: eth11: Initializing port
[ 2706.160022] mlx4_en 0004:01:00.0: Activating port:2
[ 2706.165214] mlx4_core 0004:01:00.0 eth14: renamed from eth11
[ 2706.188419] mlx4_en: eth11: Using 256 TX rings
[ 2706.188427] mlx4_en: eth11: Using 8 RX rings
[ 2706.188430] mlx4_en: eth11:   frag:0 - size:1518 prefix:0 stride:1536
[ 2706.188660] mlx4_en: eth11: Initializing port
[ 2706.197316] EEH: Notify device driver to resume
[ 2706.525987] mlx4_core 0004:01:00.0 eth16: renamed from eth11
[ 2707.487156] mlx4_en: eth14: Link Up
[ 2707.542052] mlx4_en: eth16: Link Up

thanks.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1422481

Title:
  mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)

Status in linux package in Ubuntu:
  Fix Released

Bug description:
  ---Problem Description---
  EEH is not working with mlx4 driver. When the driver recovered it hits another EEH. 
    
  ---uname output---
  Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  Need Mellanox adapter like Connect 3 adapter. 

  Machine Type = P8 
    
  ---Steps to Reproduce---
   Just inject EEH to mlx4 device. 
   
  Stack trace output:
   from EEH recovery then it hits this:
  [  188.747571] EEH: Collect temporary log
  [  188.748330] EEH: of node=/pci@800000020000007/ethernet@3
  [  188.748339] EEH: PCI device/vendor: 100715b3
  [  188.748361] EEH: PCI cmd/status register: 00100146
  [  188.748362] EEH: PCI-E capabilities and status follow:
  [  188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
  [  188.748537] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
  [  188.748539] EEH: PCI-E 20: 00000000
  [  188.748540] EEH: PCI-E AER capability register set follows:
  [  188.748625] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
  [  188.748704] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
  [  188.748783] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
  [  188.748805] EEH: PCI-E AER 30: 00000000 00000000
  [  188.748813] EEH: Reset without hotplug activity
  [  193.833245] EEH: Notify device drivers the completion of reset
  [  193.833257] mlx4_core: Initializing 0001:00:03.0
  [  193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 0x170b0000000-0x170b00fffff]
  [  193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting
  [  193.833395] EEH: Not recovered
  [  193.833397] EEH: Unable to recover from failure from PHB#1-PE#1.
  Please try reseating or replacing it
  [  193.834531] EEH: of node=/pci@800000020000007/ethernet@3
  [  193.834547] EEH: PCI device/vendor: 100715b3
  [  193.834580] EEH: PCI cmd/status register: 00100142
  [  193.834582] EEH: PCI-E capabilities and status follow:
  [  193.834728] EEH: PCI-E 00: 00020010 10008e02 0000200e 0843f483
  [  193.834846] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
  [  193.834849] EEH: PCI-E 20: 00000000
  [  193.834850] EEH: PCI-E AER capability register set follows:
  [  193.834981] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
  [  193.835101] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
  [  193.835219] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
  [  193.835252] EEH: PCI-E AER 30: 00000000 00000000
  [  193.835289] Unable to handle kernel paging request for data at address 0x00000388
  [  193.835356] Faulting instruction address: 0xd000000001f3231c
  [  193.835415] Oops: Kernel access of bad area, sig: 11 [#1]
  [  193.835460] SMP NR_CPUS=2048 NUMA pSeries
  [  193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc rtc_generic mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core
  [  193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13
  [  193.835942] task: c0000003f72ca880 ti: c0000003f707c000 task.ti: c0000003f707c000
  [  193.836009] NIP: d000000001f3231c LR: d000000001f32790 CTR: d000000001f32760
  [  193.836076] REGS: c0000003f707f790 TRAP: 0300   Not tainted  (3.18.0-12-generic)
  [  193.836141] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44000048  XER: 20000000
  [  193.836302] CFAR: c0000000000a7be0 DAR: 0000000000000388 DSISR: 40000000 SOFTE: 1
  GPR00: d000000001f32790 c0000003f707fa10 d000000001f66310 c0000003fe0ad000
  GPR04: 0000000000000003 0000000000000000 0000000000000000 c0000003fd000000
  GPR08: 0000000000000001 d000000001f32760 00000000fffffffa 0000000100001001
  GPR12: d000000001f32760 c00000000fb83600 c0000000000d9118 c0000003f90e56c0
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c4ab90
  GPR24: c000000000c4ab68 0000000000100100 c0000003fe068580 c0000003fe068580
  GPR28: c0000003fe0ad000 c0000003fe0685e0 d000000001f5da50 0000000000000000
  [  193.837205] NIP [d000000001f3231c] mlx4_unload_one+0x3c/0x480 [mlx4_core]
  [  193.837269] LR [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
  [  193.837336] Call Trace:
  [  193.837361] [c0000003f707fa10] [c0000003fe068580] 0xc0000003fe068580 (unreliable)
  [  193.837447] [c0000003f707faa0] [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
  [  193.837528] [c0000003f707fae0] [c00000000003ac64] eeh_report_failure+0xb4/0xf0
  [  193.837606] [c0000003f707fb10] [c0000000000393b4] eeh_pe_dev_traverse+0x94/0x160
  [  193.837685] [c0000003f707fba0] [c00000000003b148] eeh_handle_normal_event+0xa8/0x400
  [  193.837764] [c0000003f707fc20] [c00000000003b6b4] eeh_handle_event+0x54/0x360
  [  193.837832] [c0000003f707fcd0] [c00000000003bae4] eeh_event_handler+0x124/0x1d0
  [  193.837911] [c0000003f707fd80] [c0000000000d9220] kthread+0x110/0x130
  [  193.837980] [c0000003f707fe30] [c000000000009568] ret_from_kernel_thread+0x5c/0x74
  [  193.838057] Instruction dump:
  [  193.838094] fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71
  [  193.838217] 7c7c1b78 48000008 e8410018 ebfc0138 <813f0388> 2f890000 409e020c e93f0008
  [  193.838341] ---[ end trace 7cd21329722bcbd1 ]---

   
  There is a series of patches in this link that should resolve this issue. 
  http://permalink.gmane.org/gmane.linux.network/347777
  I had applied these in upstream kernel and it is ok but let me double check with Ubuntu 15.04 kernel if these are the patches we need to solve this bugzilla.

  I used this kernel from Ubuntu 15.04 3.18.0-13.14
  To make EEH work, to try to reach the first 2 patches of that series I have to use all this patches:

  From ca9f9f703950e5cb300526549b4f1b0a6605a5c5 Mon Sep 17 00:00:00 2001
  From: Amir Vadai <amirv@xxxxxxxxxxxx>
  Date: Tue, 25 Feb 2014 18:17:52 +0200
  Subject: net/mlx4_en: Fix bad use of dev_id

  From adbc7ac5c15eb5e9d70393428345e72a1a897d6a Mon Sep 17 00:00:00 2001
  From: Saeed Mahameed <saeedm@xxxxxxxxxxxx>
  Date: Mon, 27 Oct 2014 11:37:37 +0200
  Subject: net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap

  
  From a53e3e8c1db547981e13d1ebf24a659bd4e87710 Mon Sep 17 00:00:00 2001
  From: Saeed Mahameed <saeedm@xxxxxxxxxxxx>
  Date: Mon, 27 Oct 2014 11:37:38 +0200
  Subject: net/mlx4_core: Add ethernet backplane autoneg device capability

  From d475c95b4bcff983ac76e8522bfd2d29bcc567d0 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Sun, 2 Nov 2014 16:26:17 +0200
  Subject: net/mlx4_core: Add retrieval of CONFIG_DEV parameters

  From dd65beac48a5259945846956d4b27344dfb73bd9 Mon Sep 17 00:00:00 2001
  From: Shani Michaeli <shanim@xxxxxxxxxxxx>
  Date: Sun, 9 Nov 2014 13:51:52 +0200
  Subject: net/mlx4_en: Extend usage of napi_gro_frags

  From f8c6455bb04b944edb69e9b074e28efee2c56bdd Mon Sep 17 00:00:00 2001
  From: Shani Michaeli <shanim@xxxxxxxxxxxx>
  Date: Sun, 9 Nov 2014 13:51:53 +0200
  Subject: net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

  From ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:29 +0200
  Subject: net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup

  From a0eacca948d2d4531a393d82a736ff19b7b8fa0b Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:30 +0200
  Subject: net/mlx4_core: Refactor mlx4_load_one

  From e8c4265bea8437f5583d0c2f272058200ebc10ff Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:31 +0200
  Subject: net/mlx4_core: Add QUERY_FUNC firmware command

  From 7ae0e400cd9396c41fe596d35dcc34feaa89a04f Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:32 +0200
  Subject: net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X
   vectors for PF/VFs
  From da315679e80635021e98de1306ff4eee0759ba57 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Sun, 14 Dec 2014 16:18:04 +0200
  Subject: net/mlx4_core: Fixed memory leak and incorrect refcount in

  with those patches I can apply from the series that I pointed:

  ==> 0001-net-mlx4_core-Maintain-a-persistent-memory-for-mlx4-.patch <==
  From 872bf2fb69d90e3619befee842fc26db39d8e475 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:35 +0200
  Subject: net/mlx4_core: Maintain a persistent memory for mlx4 device

  ==> 0002-net-mlx4_core-Set-device-configuration-data-to-be-pe.patch <==
  From dd0eefe3abbf47442db296bf68f27eb2860c1cdf Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:36 +0200
  Subject: net/mlx4_core: Set device configuration data to be persistent across
   reset
  ==> 0003-net-mlx4_core-Refactor-the-catas-flow-to-work-per-de.patch <==
  From ad9a0bf08ffbf32b8f292c3bb78ca0f24bb8f6b2 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:37 +0200
  Subject: net/mlx4_core: Refactor the catas flow to work per device

  ==> 0004-net-mlx4_core-Enhance-the-catas-flow-to-support-devi.patch <==
  From f6bc11e42646e661e699a5593cbd1e9dba7191d0 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:38 +0200
  Subject: net/mlx4_core: Enhance the catas flow to support device reset

  ==> 0005-net-mlx4_core-Activate-reset-flow-upon-fatal-command.patch <==
  From f5aef5aa35063f2b45c3605871cd525d0cb7fb7a Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:39 +0200
  Subject: net/mlx4_core: Activate reset flow upon fatal command cases

  ==> 0006-net-mlx4_core-Manage-interface-state-for-Reset-flow-.patch <==
  From c69453e294c9f16da977b68e658a8028b854c209 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:40 +0200
  Subject: net/mlx4_core: Manage interface state for Reset flow cases

  ==> 0007-net-mlx4_core-Handle-AER-flow-properly.patch <==
  From 2ba5fbd62b2534335f4e3b844ecc7860115525a3 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:41 +0200
  Subject: net/mlx4_core: Handle AER flow properly

  
  but to apply the whole series to include SRIOV EEH, then I need these extra packages:
  ==> 0008-g-mlx4.patch <==
  From 225c6c8c6bbbc32455df3d1c0fb1e1e1fb51c533 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:28 +0200
  Subject: net/mlx4_core: Use correct variable type for mlx4_slave_cap

  ==> 0008-l-mlx4.patch <==
  From de966c5928026b100a989c8cef761d306310a184 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 13 Nov 2014 14:45:33 +0200
  Subject: net/mlx4_core: Support more than 64 VFs

  ==> 0008-m-mlx4.patch <==
  From 383677da43fa83b390888cf7d25885166b2a6812 Mon Sep 17 00:00:00 2001
  From: Or Gerlitz <ogerlitz@xxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:52 +0200
  Subject: net/mlx4_core: Mask out host side virtualization features for guests

  ==> 0008-net-mlx4_core-Enable-device-recovery-flow-with-SRIOV.patch <==
  From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:42 +0200
  Subject: net/mlx4_core: Enable device recovery flow with SRIOV

  ==> 0008-n-mlx4.patch <==
  From ddae0349fdb78bcc5e7219061847012aa1a29069 Mon Sep 17 00:00:00 2001
  From: Eugenia Emantayev <eugenia@xxxxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:54 +0200
  Subject: net/mlx4: Change QP allocation scheme

  ==> 0008-o-mlx4.patch <==
  From 431df8c7e9708433459fd806a08308997de43121 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:59 +0200
  Subject: net/mlx4: Refactor QUERY_PORT

  ==> 0008-p-mlx4.patch <==
  From ab256e5ad02b36951f01bf6b5cfda25f14820847 Mon Sep 17 00:00:00 2001
  From: Dotan Barak <dotanb@xxxxxxxxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:55 +0200
  Subject: net/mlx4: Add a check if there are too many reserved QPs

  ==> 0008-r-mlx4.patch <==
  From d57febe1a47801ef8a55dbf10672850523dfaa60 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:57 +0200
  Subject: net/mlx4: Add A0 hybrid steering

  ==> 0008-s-mlx4.patch <==
  From 7d077cd34eabb2ffd05abe0f2cad01da1ef11712 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:58:00 +0200
  Subject: net/mlx4: Add support for A0 steering

  ==> 0008-z-mlx4.patch <==
  From 7a89399ffad7b7c47b43afda010309b3b88538c0 Mon Sep 17 00:00:00 2001
  From: Matan Barak <matanb@xxxxxxxxxxxx>
  Date: Thu, 11 Dec 2014 10:57:56 +0200
  Subject: net/mlx4: Add mlx4_bitmap zone allocator

  So then I can apply these
  From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:42 +0200
  Subject: net/mlx4_core: Enable device recovery flow with SRIOV

  ==> 0009-net-mlx4_core-Reset-flow-activation-upon-SRIOV-fatal.patch <==
  From 0cd9302734111abc0b5912b695336f2ee63cb22b Mon Sep 17 00:00:00 2001
  From: Yishai Hadas <yishaih@xxxxxxxxxxxx>
  Date: Sun, 25 Jan 2015 16:59:43 +0200
  Subject: net/mlx4_core: Reset flow activation upon SRIOV fatal command cases

  So basically to apply the series will need a lot of patches and
  probably restest the driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1422481/+subscriptions