← Back to team overview

kernel-packages team mailing list archive

[Bug 1522071] [NEW] EEH recovery fails for shinner T on firestone

 

You have been subscribed to a public bug:

== Comment: #0 - Manvanthara B. Puttashankar <mputtash@xxxxxxxxxx> - 2015-07-27 02:38:12 ==
---Problem Description---
EEH recovery fails for shinner T on firestone 
 
Contact Information = mputtash@xxxxxxxxxx 
 
---uname output---
Linux rcx2c309 3.19.0-23-generic #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = firestone 
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 root@rcx2c309:~# uname -a
Linux rcx2c309 3.19.0-23-generic #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

root@rcx2c309:~# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   100baseT/Half 100baseT/Full
                                1000baseT/Full
                                10000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Advertised link modes:  100baseT/Half 100baseT/Full
                                1000baseT/Full
                                10000baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: Transmit-only
        Link partner advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 17
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: Unknown
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000000 (0)

        Link detected: yes


root@rcx2c309:/sys/bus/pci/devices/0001:01:00.1# ll /sys/class/net/
total 0
drwxr-xr-x  2 root root 0 Jul 24 04:23 ./
drwxr-xr-x 58 root root 0 Jul 24 03:45 ../
lrwxrwxrwx  1 root root 0 Jul 26 23:17 eth0 -> ../../devices/pci0001:00/0001:00:00.0/0001:01:00.0/net/eth0/
lrwxrwxrwx  1 root root 0 Jul 24 07:33 eth1 -> ../../devices/pci0001:00/0001:00:00.0/0001:01:00.1/net/eth1/      <====================  this interface
lrwxrwxrwx  1 root root 0 Jul 24 03:45 lo -> ../../devices/virtual/net/lo/
lrwxrwxrwx  1 root root 0 Jul 24 03:45 virbr0 -> ../../devices/virtual/net/virbr0/


Every 2.0s: netstat -i                                               Sun
Jul 26 23:26:16 2015

Kernel Interface table
Iface   MTU Met   RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       1500 0   1230820      0  12167 0         45239      0      0      0 BMRU
lo        65536 0        22      0      0 0            22      0      0      0 LRU
virbr0     1500 0         0      0      0 0             0      0      0      0 BMU




syslog:

Jul 27 01:09:54 rcx2c309 kernel: [   68.122649] EEH: Frozen PE#1 on PHB#1 detected
Jul 27 01:09:54 rcx2c309 kernel: [   68.122790] EEH: PE location: N/A, PHB location: N/A
Jul 27 01:09:54 rcx2c309 kernel: [   68.123539] EEH: This PCI device has failed 1 times in the last hour
Jul 27 01:09:54 rcx2c309 kernel: [   68.123540] EEH: Notify device drivers to shutdown
Jul 27 01:09:54 rcx2c309 kernel: [   68.123545] bnx2x: [bnx2x_io_error_detected:13702(eth0)]IO error detected
Jul 27 01:09:54 rcx2c309 kernel: [   68.123706] bnx2x: [bnx2x_io_error_detected:13702(eth1)]IO error detected
Jul 27 01:09:54 rcx2c309 kernel: [   68.154922] bnx2x: [bnx2x_timer:5753(eth1)]MFW seems hanged: drv_pulse (0x75) != mcp_pulse (0x7fff)
Jul 27 01:09:54 rcx2c309 kernel: [   68.155146] EEH: Collect temporary log
Jul 27 01:09:54 rcx2c309 kernel: [   68.235532] PHB3 PHB#1 Diag-data (Version: 1)
Jul 27 01:09:54 rcx2c309 kernel: [   68.235535] brdgCtl:     00000002
Jul 27 01:09:54 rcx2c309 kernel: [   68.235538] RootSts:     00000040 00400000 f0820048 00100147 00002000
Jul 27 01:09:54 rcx2c309 kernel: [   68.235541] PhbSts:      0000001c00000000 0000001c00000000
Jul 27 01:09:54 rcx2c309 kernel: [   68.235543] Lem:         0000001000000004 42498e327f502eae 0000000000000000
Jul 27 01:09:54 rcx2c309 kernel: [   68.235546] OutErr:      0000000800000000 0000000800000000 0204006000003b10 113c7cd800000000
Jul 27 01:09:54 rcx2c309 kernel: [   68.235549] InBErr:      0000000000000020 0000000000000020 4001010000000000 0000000000000000
Jul 27 01:09:54 rcx2c309 kernel: [   68.235551] PE[  1] A/B: 8400001b00000000 80003b10113c7cd8
Jul 27 01:09:54 rcx2c309 kernel: [   68.235554] EEH: Reset without hotplug activity
Jul 27 01:09:54 rcx2c309 kernel: [   68.236546] EEH: PHB#1 failure detected, location: N/A
Jul 27 01:09:54 rcx2c309 kernel: [   68.236698] CPU: 9 PID: 1093 Comm: kworker/9:1 Tainted: G           OE  3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:54 rcx2c309 kernel: [   68.236704] Workqueue: events linkwatch_event
Jul 27 01:09:54 rcx2c309 kernel: [   68.236706] Call Trace:
Jul 27 01:09:54 rcx2c309 kernel: [   68.236709] [c000003c9923b6c0] [c000000000a26690] dump_stack+0x90/0xbc (unreliable)
Jul 27 01:09:54 rcx2c309 kernel: [   68.236713] [c000003c9923b6f0] [c000000000036a5c] eeh_dev_check_failure+0x22c/0x560
Jul 27 01:09:54 rcx2c309 kernel: [   68.236715] [c000003c9923b790] [c000000000036e14] eeh_check_failure+0x84/0xe0
Jul 27 01:09:54 rcx2c309 kernel: [   68.236737] [c000003c9923b7d0] [d00000001c7854a0] bnx2x_get_ext_phy_fw_version+0x1e0/0x220 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [   68.236746] [c000003c9923b830] [d00000001c794c34] bnx2x_fill_fw_str+0x64/0x140 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [   68.236754] [c000003c9923b8e0] [d00000001c79f2ac] bnx2x_get_drvinfo+0x6c/0x100 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [   68.236761] [c000003c9923b910] [d00000001e34f9b0] netdevice_event+0xc0/0x350 [ib_core]
Jul 27 01:09:54 rcx2c309 kernel: [   68.236765] [c000003c9923ba90] [c0000000000dbce8] notifier_call_chain+0x98/0x100
Jul 27 01:09:54 rcx2c309 kernel: [   68.236767] [c000003c9923bae0] [c0000000008b796c] call_netdevice_notifiers_info+0x5c/0xb0
Jul 27 01:09:54 rcx2c309 kernel: [   68.236770] [c000003c9923bb60] [c0000000008bde48] netdev_state_change+0x48/0x80
Jul 27 01:09:54 rcx2c309 kernel: [   68.236772] [c000003c9923bba0] [c0000000008db014] linkwatch_do_dev+0x74/0xd0
Jul 27 01:09:54 rcx2c309 kernel: [   68.236773] [c000003c9923bbd0] [c0000000008db54c] __linkwatch_run_queue+0x14c/0x270
Jul 27 01:09:54 rcx2c309 kernel: [   68.236775] [c000003c9923bc40] [c0000000008db6b4] linkwatch_event+0x44/0x60
Jul 27 01:09:54 rcx2c309 kernel: [   68.236778] [c000003c9923bc60] [c0000000000d291c] process_one_work+0x19c/0x480
Jul 27 01:09:54 rcx2c309 kernel: [   68.236780] [c000003c9923bcf0] [c0000000000d31c0] worker_thread+0x190/0x5b0
Jul 27 01:09:54 rcx2c309 kernel: [   68.236782] [c000003c9923bd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:54 rcx2c309 kernel: [   68.236785] [c000003c9923be30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [   70.038711] pnv_ioda_unfreeze_pe: Failure -6 clear 1 on PHB#1-PE#1
Jul 27 01:09:56 rcx2c309 kernel: [   70.038713] eeh_pci_enable: Unexpected state change 2 on PHB#1-PE#1, err=-5
Jul 27 01:09:56 rcx2c309 kernel: [   70.038937] pnv_ioda_unfreeze_pe: Failure -6 clear 2 on PHB#1-PE#1
Jul 27 01:09:56 rcx2c309 kernel: [   70.038938] eeh_pci_enable: Unexpected state change 3 on PHB#1-PE#1, err=-5
Jul 27 01:09:56 rcx2c309 kernel: [   70.038940] EEH: Notify device drivers the completion of reset
Jul 27 01:09:56 rcx2c309 kernel: [   70.038943] bnx2x: [bnx2x_io_slot_reset:13737(eth0)]IO slot reset initializing...
Jul 27 01:09:56 rcx2c309 kernel: [   70.039706] EEH: Frozen PHB#1-PE#1 detected
Jul 27 01:09:56 rcx2c309 kernel: [   70.039733] EEH: PE location: N/A, PHB location: N/A
Jul 27 01:09:56 rcx2c309 kernel: [   70.039767] CPU: 9 PID: 812 Comm: eehd Tainted: G           OE  3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:56 rcx2c309 kernel: [   70.039768] Call Trace:
Jul 27 01:09:56 rcx2c309 kernel: [   70.039770] [c000003ca1e6f840] [c000000000a26690] dump_stack+0x90/0xbc (unreliable)
Jul 27 01:09:56 rcx2c309 kernel: [   70.039772] [c000003ca1e6f870] [c000000000036d74] eeh_dev_check_failure+0x544/0x560
Jul 27 01:09:56 rcx2c309 kernel: [   70.039775] [c000003ca1e6f910] [c000000000076c9c] pnv_pci_read_config+0x13c/0x1a0
Jul 27 01:09:56 rcx2c309 kernel: [   70.039778] [c000003ca1e6f960] [c000000000561204] pci_bus_read_config_word+0xc4/0x110
Jul 27 01:09:56 rcx2c309 kernel: [   70.039781] [c000003ca1e6f9c0] [c00000000056f574] pci_enable_device_flags+0x174/0x1a0
Jul 27 01:09:56 rcx2c309 kernel: [   70.039790] [c000003ca1e6fa10] [d00000001c761dc4] bnx2x_io_slot_reset+0x94/0x570 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [   70.039792] [c000003ca1e6fad0] [c00000000003ab04] eeh_report_reset+0x104/0x140
Jul 27 01:09:56 rcx2c309 kernel: [   70.039793] [c000003ca1e6fb10] [c0000000000395c8] eeh_pe_dev_traverse+0x98/0x170
Jul 27 01:09:56 rcx2c309 kernel: [   70.039795] [c000003ca1e6fba0] [c00000000003b584] eeh_handle_normal_event+0x334/0x410
Jul 27 01:09:56 rcx2c309 kernel: [   70.039797] [c000003ca1e6fc20] [c00000000003b968] eeh_handle_event+0x188/0x340
Jul 27 01:09:56 rcx2c309 kernel: [   70.039799] [c000003ca1e6fcd0] [c00000000003bce8] eeh_event_handler+0x1c8/0x1d0
Jul 27 01:09:56 rcx2c309 kernel: [   70.039801] [c000003ca1e6fd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:56 rcx2c309 kernel: [   70.039803] [c000003ca1e6fe30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [   70.054577] pci_raw_set_power_state: 33 callbacks suppressed
Jul 27 01:09:56 rcx2c309 kernel: [   70.054580] bnx2x 0001:01:00.0: Refused to change power state, currently in D3
Jul 27 01:09:56 rcx2c309 kernel: [   70.114605] bnx2x: [bnx2x_io_slot_reset:13797(eth0)]pci_cleanup_aer_uncorrect_error_status failed
Jul 27 01:09:56 rcx2c309 kernel: [   70.114817] bnx2x: [bnx2x_io_slot_reset:13737(eth1)]IO slot reset initializing...
Jul 27 01:09:56 rcx2c309 kernel: [   70.130577] bnx2x 0001:01:00.1: Refused to change power state, currently in D3
Jul 27 01:09:56 rcx2c309 kernel: [   70.214576] bnx2x: [bnx2x_io_slot_reset:13753(eth1)]IO slot reset --> driver unload
Jul 27 01:09:56 rcx2c309 kernel: [   70.214790] Unable to handle kernel paging request for data at address 0xd0000801827fffff
Jul 27 01:09:56 rcx2c309 kernel: [   70.214965] Faulting instruction address: 0xd00000001c742a70
Jul 27 01:09:56 rcx2c309 kernel: [   70.215007] Oops: Kernel access of bad area, sig: 11 [#1]
Jul 27 01:09:56 rcx2c309 kernel: [   70.215039] SMP NR_CPUS=2048 NUMA PowerNV
Jul 27 01:09:56 rcx2c309 kernel: [   70.215074] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables x_tables ast ttm joydev mac_hid hid_generic usbhid at24 ipmi_powernv powernv_rng ipmi_msghandler uio_pdrv_genirq drm_kms_helper uio hid drm syscopyarea sysfillrect sysimgblt i2c_algo_bit nfsd auth_rpcgss nfs_acl nfs lockd knem(OE) grace sunrpc fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) vxlan ip6_udp_tunnel udp_tunnel mlx4_core(OE) mlx_compat(OE) uas usb_storage bnx2x ahci libahci mdio libcrc32c
Jul 27 01:09:56 rcx2c309 kernel: [   70.215777] CPU: 9 PID: 812 Comm: eehd Tainted: G           OE  3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:56 rcx2c309 kernel: [   70.215834] task: c000003ca0139100 ti: c000003ca1e6c000 task.ti: c000003ca1e6c000
Jul 27 01:09:56 rcx2c309 kernel: [   70.216017] NIP: d00000001c742a70 LR: d00000001c742a50 CTR: c000000000036d90
Jul 27 01:09:56 rcx2c309 kernel: [   70.216066] REGS: c000003ca1e6f710 TRAP: 0300   Tainted: G           OE   (3.19.0-23-generic)
Jul 27 01:09:56 rcx2c309 kernel: [   70.216122] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28008084  XER: 00000000
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] CFAR: c000000000036e24 DAR: d0000801827fffff DSISR: 40000000 SOFTE: 1
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR00: d00000001c742a50 c000003ca1e6f990 d00000001c809348 d0000801827fffff
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR04: 0000000000000001 c000003ca1e6f970 9000000100009033 0000000000000001
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR08: 0000000000000000 0000000000000000 0000000000000000 d00000001c7d2030
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR12: 0000000000008800 c00000000fb85100 c0000000000da3e8 c000001fe2931980
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c51108
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR24: c000000000c510e0 0000000000100100 c000001fe25d0000 c000001fe25d0000
Jul 27 01:09:56 rcx2c309 kernel: [   70.216246] GPR28: ffffffffffffffff 0000000000000033 00000000ffffffff c000001fe198c900
Jul 27 01:09:56 rcx2c309 kernel: [   70.217034] NIP [d00000001c742a70] bnx2x_init_shmem+0x180/0x1f0 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [   70.217081] LR [d00000001c742a50] bnx2x_init_shmem+0x160/0x1f0 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [   70.217122] Call Trace:
Jul 27 01:09:56 rcx2c309 kernel: [   70.217145] [c000003ca1e6f990] [d00000001c742a50] bnx2x_init_shmem+0x160/0x1f0 [bnx2x] (unreliable)
Jul 27 01:09:56 rcx2c309 kernel: [   70.217217] [c000003ca1e6fa10] [d00000001c761f48] bnx2x_io_slot_reset+0x218/0x570 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [   70.217274] [c000003ca1e6fad0] [c00000000003ab04] eeh_report_reset+0x104/0x140
Jul 27 01:09:56 rcx2c309 kernel: [   70.217331] [c000003ca1e6fb10] [c0000000000395c8] eeh_pe_dev_traverse+0x98/0x170
Jul 27 01:09:56 rcx2c309 kernel: [   70.217389] [c000003ca1e6fba0] [c00000000003b584] eeh_handle_normal_event+0x334/0x410
Jul 27 01:09:56 rcx2c309 kernel: [   70.217445] [c000003ca1e6fc20] [c00000000003b968] eeh_handle_event+0x188/0x340
Jul 27 01:09:56 rcx2c309 kernel: [   70.217502] [c000003ca1e6fcd0] [c00000000003bce8] eeh_event_handler+0x1c8/0x1d0
Jul 27 01:09:56 rcx2c309 kernel: [   70.217558] [c000003ca1e6fd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:56 rcx2c309 kernel: [   70.217608] [c000003ca1e6fe30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [   70.217798] Instruction dump:
Jul 27 01:09:56 rcx2c309 kernel: [   70.217825] 40820014 792a07e1 4182000c 4808f5e5 e8410018 893f0033 e87f0020 939f0928
Jul 27 01:09:56 rcx2c309 kernel: [   70.217917] 79291768 7fde4a14 7c63f214 7c0004ac <81230000> 0c090000 4c00012c 2f89ffff
Jul 27 01:09:56 rcx2c309 kernel: [   70.218009] ---[ end trace 8d49f86574f73f94 ]---
Jul 27 01:09:56 rcx2c309 kernel: [   70.218041]



 
Userspace tool common name: EEH 
 
The userspace tool has the following bit modes: ppc64le 

Userspace rpm: EEH

Userspace tool obtained from project website:  na 
 
*Additional Instructions for mputtash@xxxxxxxxxx: 
-Post a private note with access information to the machine that the bug is occuring on.
-Attach ltrace and strace of userspace application.

== Comment: #8 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-06 21:01:21 ==
Manvanthara, please catch me through sametime to provide the machine access info so that I can debug it and come up with patch to fix it, thanks!

== Comment: #10 - Mukesh K. Ojha <mukeojha@xxxxxxxxxx> - 2015-08-18 04:50:19 ==
Hi All,

Any update on this issue?

== Comment: #13 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 20:38:18 ==
Actually, Manvanthara is reporting two different issues from comment#0 and comment#7. I'm looking at the problem reported from comment#7, which can be reproduced with 4.2.rc8 (upstream kernel). I think we might open another bug to trace the issue from comment#7 and let this bug track the issue from comment#0 if Manvanthara agree, as they're different issue from my perspective, thanks!

== Comment: #14 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 22:03:16 ==
One patch was sent to community for review, which is tracked by following link. Also, I installed one private kernel that was built from 4.2.rc8 + the patch. EEH error can be recovered successfully without problem. The kernel can be selected from petiboot menu "Ubuntu, with Linux 4.2.0-rc8gavin+" in case any body want to have a try, thanks!

https://patchwork.ozlabs.org/patch/511744/   ("powerpc/eeh: Fix fenced
PHB caused by eeh_slot_error_detail()")

== Comment: #15 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 23:42:16 ==
Please ignore the part of "there're different issues" on comment 13. It should be corrected as: they are same issues. So we don't need open another bug at all. Sorry for those stupid confusion :-)

== Comment: #16 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 23:43:36 ==
I was told by Michael Ellerman the patch will be put into 4.3.rc3. Closing it as "fixed".

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: architecture-ppc64le bugnameltc-128071 severity-critical targetmilestone-inin---
-- 
EEH recovery fails for shinner T on firestone
https://bugs.launchpad.net/bugs/1522071
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.