group.of.nepali.translators team mailing list archive
-
group.of.nepali.translators team
-
Mailing list archive
-
Message #06224
[Bug 1603449] Re: [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection
Guo Wen Shan - how about if you take a stab at the backports for these 2
patches, 'cause I don't think they make sense for a 4.4 kernel.
** Also affects: linux (Ubuntu Xenial)
Importance: Undecided
Status: New
** Changed in: linux (Ubuntu Xenial)
Status: New => In Progress
** Changed in: linux (Ubuntu Xenial)
Assignee: (unassigned) => Tim Gardner (timg-tpi)
--
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1603449
Title:
[LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad
area, sig: 11 [#1] while executing Froze PE Error injection
Status in linux package in Ubuntu:
Triaged
Status in linux source package in Xenial:
In Progress
Bug description:
== Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsubra@xxxxxxxxxx> - 2016-07-13 01:28:56 ==
---Problem Description---
Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]
---uname output---
Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
root@ltc-garri2:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
0001:00:00.0 PCI bridge: IBM Device 03dc
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:00:00.0 PCI bridge: IBM Device 03dc
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0007:00:00.0 PCI bridge: IBM Device 03dc
0008:00:00.0 Bridge: IBM Device 04ea
0008:00:00.1 Bridge: IBM Device 04ea
0008:00:01.0 Bridge: IBM Device 04ea
0008:00:01.1 Bridge: IBM Device 04ea
0009:00:00.0 Bridge: IBM Device 04ea
0009:00:00.1 Bridge: IBM Device 04ea
0009:00:01.0 Bridge: IBM Device 04ea
0009:00:01.1 Bridge: IBM Device 04ea
Machine Type = P8
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1.
Then execute the Frozen PE error injection tests as shown below:
root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
0004:00:00.0 0604: 1014:03dc
0
Immediately the kernel crashes with a Oops Message.
Contact Information = pavsubra@xxxxxxxxxx
Stack trace output:
[ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
Oops output:
[ 289.294622] EEH: Frozen PE#0 on PHB#4 detected
[ 289.294785] EEH: PE location: N/A, PHB location: N/A
[ 289.295598] EEH: This PCI device has failed 1 times in the last hour
[ 289.295600] EEH: Notify device drivers to shutdown
[ 289.295605] EEH: Collect temporary log
[ 289.295632] EEH: of node=0004:00:00:0
[ 289.295635] EEH: PCI device/vendor: 03dc1014
[ 289.295638] EEH: PCI cmd/status register: 00100106
[ 289.295641] EEH: Bridge secondary status: 0000
[ 289.295644] EEH: Bridge control: 0002
[ 289.295645] EEH: PCI-E capabilities and status follow:
[ 289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
[ 289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
[ 289.295664] EEH: PCI-E 20: 00000000
[ 289.295665] EEH: PCI-E AER capability register set follows:
[ 289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
[ 289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
[ 289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 289.295690] EEH: PCI-E AER 30: 00000000 00000000
[ 289.295693] PHB3 PHB#4 Diag-data (Version: 1)
[ 289.295695] brdgCtl: 00000002
[ 289.295697] UtlSts: 00080000 00000000 00000000
[ 289.295699] RootSts: 00000040 00000000 01010008 00100102 00000000
[ 289.295701] PhbSts: 0000001c00000000 0000001c00000000
[ 289.295704] Lem: 0000000000100000 42498e367f502eae 0000000000000000
[ 289.295706] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000
[ 289.295708] PE[ 0] A/B: 8440002b00000000 8000000000000000
[ 289.295711] EEH: Reset with hotplug activity
[ 289.295726] pci_bus 0004:01: busn_res: [bus 01] is released
[ 289.295868] Unable to handle kernel paging request for data at address 0x00000010
[ 289.295937] Faulting instruction address: 0xc000000000083c7c
[ 289.295997] Oops: Kernel access of bad area, sig: 11 [#1]
[ 289.296043] SMP NR_CPUS=2048 NUMA PowerNV
[ 289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core
[ 289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P OE 4.4.0-30-generic #49-Ubuntu
[ 289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: c000000feeb88000
[ 289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 289.296848] REGS: c000000feeb8b760 TRAP: 0300 Tainted: P OE (4.4.0-30-generic)
[ 289.296915] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000
[ 289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
GPR00: c000000000083c78 c000000feeb8b9e0 c0000000015b5d00 0000000000000000
GPR04: 0000000000000001 c000000feeb8bac0 c000001e4e693540 0000000000000ff7
GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001c
GPR12: c000000000083c20 c000000007b20980 c0000000000e6318 c000001e4e7a0340
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468
GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000
GPR28: c00000000161a3f0 0000000000000001 c000001fff764480 c000001e4e744000
[ 289.297867] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[ 289.297907] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
[ 289.298731] ---[ end trace 393da961db41eff1 ]---
[ 289.452447]
System Dump Info:
The system is not configured to capture a system dump.
*Additional Instructions for pavsubra@xxxxxxxxxx:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.
== Comment: #2 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2016-07-15 09:42:09 ==
Below two patches are needed:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1603449/+subscriptions