kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #151626
[Bug 1522071] Re: EEH recovery fails for shinner T on firestone
** Changed in: linux (Ubuntu Vivid)
Status: In Progress => Fix Committed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1522071
Title:
EEH recovery fails for shinner T on firestone
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Vivid:
Fix Committed
Status in linux source package in Wily:
Fix Released
Status in linux source package in Xenial:
Fix Released
Bug description:
== Comment: #0 - Manvanthara B. Puttashankar <mputtash@xxxxxxxxxx> - 2015-07-27 02:38:12 ==
---Problem Description---
EEH recovery fails for shinner T on firestone
Contact Information = mputtash@xxxxxxxxxx
---uname output---
Linux rcx2c309 3.19.0-23-generic #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = firestone
---Debugger---
A debugger is not configured
---Steps to Reproduce---
root@rcx2c309:~# uname -a
Linux rcx2c309 3.19.0-23-generic #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
root@rcx2c309:~# ethtool eth1
Settings for eth1:
Supported ports: [ TP ]
Supported link modes: 100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Advertised link modes: 100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric Receive-only
Advertised auto-negotiation: Yes
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: Transmit-only
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 17
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000000 (0)
Link detected: yes
root@rcx2c309:/sys/bus/pci/devices/0001:01:00.1# ll /sys/class/net/
total 0
drwxr-xr-x 2 root root 0 Jul 24 04:23 ./
drwxr-xr-x 58 root root 0 Jul 24 03:45 ../
lrwxrwxrwx 1 root root 0 Jul 26 23:17 eth0 -> ../../devices/pci0001:00/0001:00:00.0/0001:01:00.0/net/eth0/
lrwxrwxrwx 1 root root 0 Jul 24 07:33 eth1 -> ../../devices/pci0001:00/0001:00:00.0/0001:01:00.1/net/eth1/ <==================== this interface
lrwxrwxrwx 1 root root 0 Jul 24 03:45 lo -> ../../devices/virtual/net/lo/
lrwxrwxrwx 1 root root 0 Jul 24 03:45 virbr0 -> ../../devices/virtual/net/virbr0/
Every 2.0s: netstat -i
Sun Jul 26 23:26:16 2015
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1 1500 0 1230820 0 12167 0 45239 0 0 0 BMRU
lo 65536 0 22 0 0 0 22 0 0 0 LRU
virbr0 1500 0 0 0 0 0 0 0 0 0 BMU
syslog:
Jul 27 01:09:54 rcx2c309 kernel: [ 68.122649] EEH: Frozen PE#1 on PHB#1 detected
Jul 27 01:09:54 rcx2c309 kernel: [ 68.122790] EEH: PE location: N/A, PHB location: N/A
Jul 27 01:09:54 rcx2c309 kernel: [ 68.123539] EEH: This PCI device has failed 1 times in the last hour
Jul 27 01:09:54 rcx2c309 kernel: [ 68.123540] EEH: Notify device drivers to shutdown
Jul 27 01:09:54 rcx2c309 kernel: [ 68.123545] bnx2x: [bnx2x_io_error_detected:13702(eth0)]IO error detected
Jul 27 01:09:54 rcx2c309 kernel: [ 68.123706] bnx2x: [bnx2x_io_error_detected:13702(eth1)]IO error detected
Jul 27 01:09:54 rcx2c309 kernel: [ 68.154922] bnx2x: [bnx2x_timer:5753(eth1)]MFW seems hanged: drv_pulse (0x75) != mcp_pulse (0x7fff)
Jul 27 01:09:54 rcx2c309 kernel: [ 68.155146] EEH: Collect temporary log
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235532] PHB3 PHB#1 Diag-data (Version: 1)
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235535] brdgCtl: 00000002
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235538] RootSts: 00000040 00400000 f0820048 00100147 00002000
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235541] PhbSts: 0000001c00000000 0000001c00000000
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235543] Lem: 0000001000000004 42498e327f502eae 0000000000000000
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235546] OutErr: 0000000800000000 0000000800000000 0204006000003b10 113c7cd800000000
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235549] InBErr: 0000000000000020 0000000000000020 4001010000000000 0000000000000000
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235551] PE[ 1] A/B: 8400001b00000000 80003b10113c7cd8
Jul 27 01:09:54 rcx2c309 kernel: [ 68.235554] EEH: Reset without hotplug activity
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236546] EEH: PHB#1 failure detected, location: N/A
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236698] CPU: 9 PID: 1093 Comm: kworker/9:1 Tainted: G OE 3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236704] Workqueue: events linkwatch_event
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236706] Call Trace:
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236709] [c000003c9923b6c0] [c000000000a26690] dump_stack+0x90/0xbc (unreliable)
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236713] [c000003c9923b6f0] [c000000000036a5c] eeh_dev_check_failure+0x22c/0x560
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236715] [c000003c9923b790] [c000000000036e14] eeh_check_failure+0x84/0xe0
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236737] [c000003c9923b7d0] [d00000001c7854a0] bnx2x_get_ext_phy_fw_version+0x1e0/0x220 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236746] [c000003c9923b830] [d00000001c794c34] bnx2x_fill_fw_str+0x64/0x140 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236754] [c000003c9923b8e0] [d00000001c79f2ac] bnx2x_get_drvinfo+0x6c/0x100 [bnx2x]
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236761] [c000003c9923b910] [d00000001e34f9b0] netdevice_event+0xc0/0x350 [ib_core]
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236765] [c000003c9923ba90] [c0000000000dbce8] notifier_call_chain+0x98/0x100
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236767] [c000003c9923bae0] [c0000000008b796c] call_netdevice_notifiers_info+0x5c/0xb0
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236770] [c000003c9923bb60] [c0000000008bde48] netdev_state_change+0x48/0x80
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236772] [c000003c9923bba0] [c0000000008db014] linkwatch_do_dev+0x74/0xd0
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236773] [c000003c9923bbd0] [c0000000008db54c] __linkwatch_run_queue+0x14c/0x270
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236775] [c000003c9923bc40] [c0000000008db6b4] linkwatch_event+0x44/0x60
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236778] [c000003c9923bc60] [c0000000000d291c] process_one_work+0x19c/0x480
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236780] [c000003c9923bcf0] [c0000000000d31c0] worker_thread+0x190/0x5b0
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236782] [c000003c9923bd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:54 rcx2c309 kernel: [ 68.236785] [c000003c9923be30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038711] pnv_ioda_unfreeze_pe: Failure -6 clear 1 on PHB#1-PE#1
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038713] eeh_pci_enable: Unexpected state change 2 on PHB#1-PE#1, err=-5
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038937] pnv_ioda_unfreeze_pe: Failure -6 clear 2 on PHB#1-PE#1
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038938] eeh_pci_enable: Unexpected state change 3 on PHB#1-PE#1, err=-5
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038940] EEH: Notify device drivers the completion of reset
Jul 27 01:09:56 rcx2c309 kernel: [ 70.038943] bnx2x: [bnx2x_io_slot_reset:13737(eth0)]IO slot reset initializing...
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039706] EEH: Frozen PHB#1-PE#1 detected
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039733] EEH: PE location: N/A, PHB location: N/A
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039767] CPU: 9 PID: 812 Comm: eehd Tainted: G OE 3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039768] Call Trace:
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039770] [c000003ca1e6f840] [c000000000a26690] dump_stack+0x90/0xbc (unreliable)
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039772] [c000003ca1e6f870] [c000000000036d74] eeh_dev_check_failure+0x544/0x560
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039775] [c000003ca1e6f910] [c000000000076c9c] pnv_pci_read_config+0x13c/0x1a0
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039778] [c000003ca1e6f960] [c000000000561204] pci_bus_read_config_word+0xc4/0x110
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039781] [c000003ca1e6f9c0] [c00000000056f574] pci_enable_device_flags+0x174/0x1a0
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039790] [c000003ca1e6fa10] [d00000001c761dc4] bnx2x_io_slot_reset+0x94/0x570 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039792] [c000003ca1e6fad0] [c00000000003ab04] eeh_report_reset+0x104/0x140
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039793] [c000003ca1e6fb10] [c0000000000395c8] eeh_pe_dev_traverse+0x98/0x170
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039795] [c000003ca1e6fba0] [c00000000003b584] eeh_handle_normal_event+0x334/0x410
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039797] [c000003ca1e6fc20] [c00000000003b968] eeh_handle_event+0x188/0x340
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039799] [c000003ca1e6fcd0] [c00000000003bce8] eeh_event_handler+0x1c8/0x1d0
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039801] [c000003ca1e6fd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:56 rcx2c309 kernel: [ 70.039803] [c000003ca1e6fe30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [ 70.054577] pci_raw_set_power_state: 33 callbacks suppressed
Jul 27 01:09:56 rcx2c309 kernel: [ 70.054580] bnx2x 0001:01:00.0: Refused to change power state, currently in D3
Jul 27 01:09:56 rcx2c309 kernel: [ 70.114605] bnx2x: [bnx2x_io_slot_reset:13797(eth0)]pci_cleanup_aer_uncorrect_error_status failed
Jul 27 01:09:56 rcx2c309 kernel: [ 70.114817] bnx2x: [bnx2x_io_slot_reset:13737(eth1)]IO slot reset initializing...
Jul 27 01:09:56 rcx2c309 kernel: [ 70.130577] bnx2x 0001:01:00.1: Refused to change power state, currently in D3
Jul 27 01:09:56 rcx2c309 kernel: [ 70.214576] bnx2x: [bnx2x_io_slot_reset:13753(eth1)]IO slot reset --> driver unload
Jul 27 01:09:56 rcx2c309 kernel: [ 70.214790] Unable to handle kernel paging request for data at address 0xd0000801827fffff
Jul 27 01:09:56 rcx2c309 kernel: [ 70.214965] Faulting instruction address: 0xd00000001c742a70
Jul 27 01:09:56 rcx2c309 kernel: [ 70.215007] Oops: Kernel access of bad area, sig: 11 [#1]
Jul 27 01:09:56 rcx2c309 kernel: [ 70.215039] SMP NR_CPUS=2048 NUMA PowerNV
Jul 27 01:09:56 rcx2c309 kernel: [ 70.215074] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables x_tables ast ttm joydev mac_hid hid_generic usbhid at24 ipmi_powernv powernv_rng ipmi_msghandler uio_pdrv_genirq drm_kms_helper uio hid drm syscopyarea sysfillrect sysimgblt i2c_algo_bit nfsd auth_rpcgss nfs_acl nfs lockd knem(OE) grace sunrpc fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) vxlan ip6_udp_tunnel udp_tunnel mlx4_core(OE) mlx_compat(OE) uas usb_storage bnx2x ahci libahci mdio libcrc32c
Jul 27 01:09:56 rcx2c309 kernel: [ 70.215777] CPU: 9 PID: 812 Comm: eehd Tainted: G OE 3.19.0-23-generic #24~14.04.1-Ubuntu
Jul 27 01:09:56 rcx2c309 kernel: [ 70.215834] task: c000003ca0139100 ti: c000003ca1e6c000 task.ti: c000003ca1e6c000
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216017] NIP: d00000001c742a70 LR: d00000001c742a50 CTR: c000000000036d90
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216066] REGS: c000003ca1e6f710 TRAP: 0300 Tainted: G OE (3.19.0-23-generic)
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216122] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008084 XER: 00000000
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] CFAR: c000000000036e24 DAR: d0000801827fffff DSISR: 40000000 SOFTE: 1
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR00: d00000001c742a50 c000003ca1e6f990 d00000001c809348 d0000801827fffff
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR04: 0000000000000001 c000003ca1e6f970 9000000100009033 0000000000000001
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR08: 0000000000000000 0000000000000000 0000000000000000 d00000001c7d2030
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR12: 0000000000008800 c00000000fb85100 c0000000000da3e8 c000001fe2931980
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c51108
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR24: c000000000c510e0 0000000000100100 c000001fe25d0000 c000001fe25d0000
Jul 27 01:09:56 rcx2c309 kernel: [ 70.216246] GPR28: ffffffffffffffff 0000000000000033 00000000ffffffff c000001fe198c900
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217034] NIP [d00000001c742a70] bnx2x_init_shmem+0x180/0x1f0 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217081] LR [d00000001c742a50] bnx2x_init_shmem+0x160/0x1f0 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217122] Call Trace:
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217145] [c000003ca1e6f990] [d00000001c742a50] bnx2x_init_shmem+0x160/0x1f0 [bnx2x] (unreliable)
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217217] [c000003ca1e6fa10] [d00000001c761f48] bnx2x_io_slot_reset+0x218/0x570 [bnx2x]
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217274] [c000003ca1e6fad0] [c00000000003ab04] eeh_report_reset+0x104/0x140
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217331] [c000003ca1e6fb10] [c0000000000395c8] eeh_pe_dev_traverse+0x98/0x170
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217389] [c000003ca1e6fba0] [c00000000003b584] eeh_handle_normal_event+0x334/0x410
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217445] [c000003ca1e6fc20] [c00000000003b968] eeh_handle_event+0x188/0x340
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217502] [c000003ca1e6fcd0] [c00000000003bce8] eeh_event_handler+0x1c8/0x1d0
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217558] [c000003ca1e6fd80] [c0000000000da4f4] kthread+0x114/0x140
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217608] [c000003ca1e6fe30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217798] Instruction dump:
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217825] 40820014 792a07e1 4182000c 4808f5e5 e8410018 893f0033 e87f0020 939f0928
Jul 27 01:09:56 rcx2c309 kernel: [ 70.217917] 79291768 7fde4a14 7c63f214 7c0004ac <81230000> 0c090000 4c00012c 2f89ffff
Jul 27 01:09:56 rcx2c309 kernel: [ 70.218009] ---[ end trace 8d49f86574f73f94 ]---
Jul 27 01:09:56 rcx2c309 kernel: [ 70.218041]
Userspace tool common name: EEH
The userspace tool has the following bit modes: ppc64le
Userspace rpm: EEH
Userspace tool obtained from project website: na
*Additional Instructions for mputtash@xxxxxxxxxx:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach ltrace and strace of userspace application.
== Comment: #8 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-06 21:01:21 ==
Manvanthara, please catch me through sametime to provide the machine access info so that I can debug it and come up with patch to fix it, thanks!
== Comment: #10 - Mukesh K. Ojha <mukeojha@xxxxxxxxxx> - 2015-08-18 04:50:19 ==
Hi All,
Any update on this issue?
== Comment: #13 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 20:38:18 ==
Actually, Manvanthara is reporting two different issues from comment#0 and comment#7. I'm looking at the problem reported from comment#7, which can be reproduced with 4.2.rc8 (upstream kernel). I think we might open another bug to trace the issue from comment#7 and let this bug track the issue from comment#0 if Manvanthara agree, as they're different issue from my perspective, thanks!
== Comment: #14 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 22:03:16 ==
One patch was sent to community for review, which is tracked by following link. Also, I installed one private kernel that was built from 4.2.rc8 + the patch. EEH error can be recovered successfully without problem. The kernel can be selected from petiboot menu "Ubuntu, with Linux 4.2.0-rc8gavin+" in case any body want to have a try, thanks!
https://patchwork.ozlabs.org/patch/511744/ ("powerpc/eeh: Fix fenced
PHB caused by eeh_slot_error_detail()")
== Comment: #15 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 23:42:16 ==
Please ignore the part of "there're different issues" on comment 13. It should be corrected as: they are same issues. So we don't need open another bug at all. Sorry for those stupid confusion :-)
== Comment: #16 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2015-08-27 23:43:36 ==
I was told by Michael Ellerman the patch will be put into 4.3.rc3. Closing it as "fixed".
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1522071/+subscriptions