← Back to team overview

group.of.nepali.translators team mailing list archive

[Bug 1603449] Re: [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection

 

Guo Wen Shan - how about if you take a stab at the backports for these 2
patches, 'cause I don't think they make sense for a 4.4 kernel.

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Xenial)
       Status: New => In Progress

** Changed in: linux (Ubuntu Xenial)
     Assignee: (unassigned) => Tim Gardner (timg-tpi)

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1603449

Title:
  [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad
  area, sig: 11 [#1] while executing Froze PE Error injection

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  In Progress

Bug description:
  == Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsubra@xxxxxxxxxx> - 2016-07-13 01:28:56 ==
  ---Problem Description---
  Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]
   
  ---uname output---
  Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  root@ltc-garri2:~# lspci
  0000:00:00.0 PCI bridge: IBM Device 03dc
  0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
  0001:00:00.0 PCI bridge: IBM Device 03dc
  0002:00:00.0 PCI bridge: IBM Device 03dc
  0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
  0003:00:00.0 PCI bridge: IBM Device 03dc
  0004:00:00.0 PCI bridge: IBM Device 03dc
  0005:00:00.0 PCI bridge: IBM Device 03dc
  0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
  0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
  0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
  0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
  0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
  0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
  0006:00:00.0 PCI bridge: IBM Device 03dc
  0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
  0007:00:00.0 PCI bridge: IBM Device 03dc
  0008:00:00.0 Bridge: IBM Device 04ea
  0008:00:00.1 Bridge: IBM Device 04ea
  0008:00:01.0 Bridge: IBM Device 04ea
  0008:00:01.1 Bridge: IBM Device 04ea
  0009:00:00.0 Bridge: IBM Device 04ea
  0009:00:00.1 Bridge: IBM Device 04ea
  0009:00:01.0 Bridge: IBM Device 04ea
  0009:00:01.1 Bridge: IBM Device 04ea
   

   
  Machine Type = P8 
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1.
  Then execute the Frozen PE error injection tests as shown below:

  root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
  0004:00:00.0 PCI bridge: IBM Device 03dc
  root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
  eeh_slot_resets=0

  
  root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
  0004:00:00.0 PCI bridge: IBM Device 03dc
  root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
  eeh_slot_resets=0
  root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
  0004:00:00.0 0604: 1014:03dc
  0

  Immediately the kernel crashes with a Oops Message.
   
  Contact Information = pavsubra@xxxxxxxxxx 
   
  Stack trace output:
   [  289.297946] Call Trace:
  [  289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
  [  289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
  [  289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
  [  289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
  [  289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
  [  289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
  [  289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
  [  289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
  [  289.298501] Instruction dump:
  [  289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
  [  289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010

   
  Oops output:
   [  289.294622] EEH: Frozen PE#0 on PHB#4 detected
  [  289.294785] EEH: PE location: N/A, PHB location: N/A
  [  289.295598] EEH: This PCI device has failed 1 times in the last hour
  [  289.295600] EEH: Notify device drivers to shutdown
  [  289.295605] EEH: Collect temporary log
  [  289.295632] EEH: of node=0004:00:00:0
  [  289.295635] EEH: PCI device/vendor: 03dc1014
  [  289.295638] EEH: PCI cmd/status register: 00100106
  [  289.295641] EEH: Bridge secondary status: 0000
  [  289.295644] EEH: Bridge control: 0002
  [  289.295645] EEH: PCI-E capabilities and status follow:
  [  289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
  [  289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
  [  289.295664] EEH: PCI-E 20: 00000000
  [  289.295665] EEH: PCI-E AER capability register set follows:
  [  289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
  [  289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
  [  289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
  [  289.295690] EEH: PCI-E AER 30: 00000000 00000000
  [  289.295693] PHB3 PHB#4 Diag-data (Version: 1)
  [  289.295695] brdgCtl:     00000002
  [  289.295697] UtlSts:      00080000 00000000 00000000
  [  289.295699] RootSts:     00000040 00000000 01010008 00100102 00000000
  [  289.295701] PhbSts:      0000001c00000000 0000001c00000000
  [  289.295704] Lem:         0000000000100000 42498e367f502eae 0000000000000000
  [  289.295706] InAErr:      4000000000000000 4000000000000000 0202000000000000 0000000000000000
  [  289.295708] PE[  0] A/B: 8440002b00000000 8000000000000000
  [  289.295711] EEH: Reset with hotplug activity
  [  289.295726] pci_bus 0004:01: busn_res: [bus 01] is released
  [  289.295868] Unable to handle kernel paging request for data at address 0x00000010
  [  289.295937] Faulting instruction address: 0xc000000000083c7c
  [  289.295997] Oops: Kernel access of bad area, sig: 11 [#1]
  [  289.296043] SMP NR_CPUS=2048 NUMA PowerNV
  [  289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core
  [  289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P           OE   4.4.0-30-generic #49-Ubuntu
  [  289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: c000000feeb88000
  [  289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
  [  289.296848] REGS: c000000feeb8b760 TRAP: 0300   Tainted: P           OE    (4.4.0-30-generic)
  [  289.296915] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28008822  XER: 00000000
  [  289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
                 GPR00: c000000000083c78 c000000feeb8b9e0 c0000000015b5d00 0000000000000000
                 GPR04: 0000000000000001 c000000feeb8bac0 c000001e4e693540 0000000000000ff7
                 GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001c
                 GPR12: c000000000083c20 c000000007b20980 c0000000000e6318 c000001e4e7a0340
                 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
                 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468
                 GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000
                 GPR28: c00000000161a3f0 0000000000000001 c000001fff764480 c000001e4e744000
  [  289.297867] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
  [  289.297907] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
  [  289.297946] Call Trace:
  [  289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
  [  289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
  [  289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
  [  289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
  [  289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
  [  289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
  [  289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
  [  289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
  [  289.298501] Instruction dump:
  [  289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
  [  289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
  [  289.298731] ---[ end trace 393da961db41eff1 ]---
  [  289.452447]

   
  System Dump Info:
    The system is not configured to capture a system dump.
   
  *Additional Instructions for pavsubra@xxxxxxxxxx: 
  -Post a private note with access information to the machine that the bug is occuring on. 
  -Attach sysctl -a output output to the bug.

  == Comment: #2 - Guo Wen Shan <gwshan@xxxxxxxxxxx> - 2016-07-15 09:42:09 ==
  Below two patches are needed:

  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
  ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
  ("powerpc/eeh: Fix invalid cached PE primary bus")

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1603449/+subscriptions