← Back to team overview

kernel-packages team mailing list archive

[Bug 1502982] [NEW] STCOP810:Firestone: frsfp6 EEH on Bluefin does not recover with Ubuntu

 

You have been subscribed to a public bug:

Problem:
==========
Test Case Execution Record:
	
95613: EEH_Firestone_Ubuntu 14.04.03_Bluefin_Standalone on frsfp6

Error Injection Method: err_injct_inboundA

Step 1. Start HTX (I used mdt.hdbuster & only ran htx on bluefin disks)
Step 2. Inject EEH error

bluefin is in slot P1-C4 (PCI0004)

 echo 0x8000000000000000 >
/sys/kernel/debug/powerpc/PCI0004/err_injct_inboundA; sleep 1; echo 0x0
> /sys/kernel/debug/powerpc/PCI0004/err_injct_inboundA

Expected Result: Adapter/SAN disks to recover and htx still run

Actual Result:  Adapter did not recover... continuous EEH errors until
limit of 6 is reached in 1 hour

There're two patches: one for skiboot firmware and another patch, which
has been in upstream, was missed in ubuntu distro (at least 15.04). The
skiboot patch has been merged to upstream.

c7192a4 PHB3: Fix wrong PE number in error injection (skiboot)
2aa5cf9 powerpc/eeh: Fix missed PE#0 on P7IOC         (linux)

If I'm correct, I think this bug needs to be mirrored so that the Linux
patch (commit 2aa5cf9) can be backported to ubuntu distro. With the
patch backported to ubuntu 15.04, EEH works fine on Broadcom adapter
(not exactly the one where the bug was reported initially):

root@fstn2-p1:/# dmesg | grep EEH
[    0.216919] EEH: PowerNV platform initialized
[    0.570606] EEH: devices created
[    1.302482] EEH: PCI Enhanced I/O Error Handling Enabled
[   90.566761] EEH: PHB location: Slot1
[   90.567503] EEH: Frozen PHB#4-PE#0 detected
[   90.567673] EEH: PE location: Slot1, PHB location: Slot1
[   90.567930] EEH: Detected PCI bus error on PHB#4-PE#0
[   90.567935] EEH: This PCI device has failed 1 times in the last hour
[   90.567937] EEH: Notify device drivers to shutdown
[   90.567985] EEH: Collect temporary log
[   90.568971] EEH: Reset without hotplug activity
[   94.585540] EEH: Notify device drivers the completion of reset
[   94.585934] EEH: Notify device driver to resume

----

The story about this bug is: Without commit 2aa5cf9 ("powerpc/eeh: Fix
missed PE#0 on P7IOC"). PE#0 is regarded as invalid one. When kernel
sees the frozen PE#0, the frozen state is cleared and dump the PHB diag-
data, then try to recover it. When resetting the PE, the driver, which
wasn't stopped by error_detected() completely, access the MMIO space and
just causes another (recursive) EEH error. Eventually, the EEH recovery
failed. During the PE reset, the I/O path for the PE should be frozen
and MMIO access during the period should be dropped to avoid recursive
EEH error.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-131243 severity-high targetmilestone-inin14043
-- 
STCOP810:Firestone: frsfp6 EEH on Bluefin does not recover with Ubuntu
https://bugs.launchpad.net/bugs/1502982
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.