kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #142619
[Bug 1482343] Comment bridged from LTC Bugzilla
------- Comment From mahesh.salgaonkar@xxxxxxxxxx 2015-10-27 15:41 EDT-------
I just verified that issue is fixed in Ubuntu-3.19.0-32.37 kernel version
------------------------------------------------------------------------------------
Ubuntu 14.04.3 LTS ltc-fire14 hvc0
ltc-fire14 login: root
Password:
Last login: Tue Oct 27 10:11:22 CDT 2015 on hvc0
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-32-generic ppc64le)
* Documentation: https://help.ubuntu.com/
root@ltc-fire14:~# uname -a
Linux ltc-fire14 3.19.0-32-generic #37-Ubuntu SMP Wed Oct 21 10:22:35 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-fire14:~# cd /home/workload_scripts/
root@ltc-fire14:/home/workload_scripts# ls
find_work.sh run_workload.sh
root@ltc-fire14:/home/workload_scripts# ./run_workload.sh
root@ltc-fire14:/home/workload_scripts# getscom -l
Chip ID | Rev | Chip type
---------|-------|--------
80000085 | DD2.0 | Centaur memory buffer
80000084 | DD2.0 | Centaur memory buffer
80000005 | DD2.0 | Centaur memory buffer
80000004 | DD2.0 | Centaur memory buffer
00000008 | DD2.0 | P8 (Venice) processor
00000000 | DD2.0 | P8 (Venice) processor
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013100
0
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013106
15a20c688a448b01
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013107
ea5c139705980000
root@ltc-fire14:/home/workload_scripts# putscom -c 0x0 11013107 fa5c139705980000
fa5c139705980000
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013107
fa5c139705980000
root@ltc-fire14:/home/workload_scripts# putscom -c 0x0 11013100 1000000000000000
[ 333.045651] Fatal Hypervisor Maintenance interrupt [Not recovered]
[ 333.045916] Error detail: Malfunction Alert
[ 333.046288] HMER: 8040000000000000
[ 333.046543] CPU PIR: 00000000
[ 333.046601] [Unit: IFU] RegFile core check stop
[ 333.046778] [Unit: PC ] Debug Trigger Error inject
1000000000000008[ 333.046883] F
[194049345926,0] OPAL: Reboot requested due to Platform error.at[194049767279,3] OPAL: Reboot requested due to Platform error.al 1.69405|ERRL|Dumping errors reported prior to registration
3.46924|Ignoring boot flags, incorrect version 0x0
3.70396|ISTEP 6. 3
4.14478|ISTEP 6. 4
4.14531|ISTEP 6. 5
10.54385|HWAS|PRESENT> DIMM[03]=00000000AAAAAAAA
10.54386|HWAS|PRESENT> Membuf[04]=0C0C000000000000
10.54387|HWAS|PRESENT> Proc[05]=C000000000000000
23.49515|ISTEP 6. 6
[...]
------------------------------------------------------------------------------------
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1482343
Title:
Trigger a checkstop on unrecoverable MCE/HMI errors to inform BMC/OCC
about the error.
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Vivid:
Fix Committed
Status in linux source package in Wily:
Fix Released
Bug description:
The current implementation of Machine Check handler and HMI handler in
Linux, goes down kernel panic path for unrecoverable errors. On FSP
based system FSP also gets notified about these errors which then
forwards it to PRD (that runs on FSP) for error analysis and gard
record creation.
On OpenPower (BMC based system e.g. Habanero from TYAN) where PRD runs
in Linux host, it never gets a chance to do error analysis at the time
of Linux crash and no gard record is created for such errors. Since
the faulty component never gets de-configured, the system is
vulnerable to get hit by same HW error again.
To fix this issue, a new OPAL call 'opal_cec_reboot2()' has been
introduced to trigger a checkstop on BMC based system to inform
BMC/OCC about this error, so that BMC can collect relevant data for
error analysis and decide what component to de-configure before
rebooting. Linux kernel should invoke this opal call for unrecoverable
MCE and HMI instead before calling kernel panic so that OCC is
informed about the error.
The kernel changes has already been posted to upstream and are listed
below:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128341.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128342.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132045.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132114.html
Above patches needs to be included in ubuntu 14.04.3+
We will update this bug with commit ids, once the above patches are
accepted upstream.
Contact Information = mahesh.salgaonkar@xxxxxxxxxx
---uname output---
Linux rcx2d403 3.19.0-26-generic #27 SMP Tue Aug 4 01:38:15 CDT 2015 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
Habanero pass2 system
Machine Type = OpenPower, Habanero
---System Hang---
If system is hung, it can be recovered by sending ipmi power off/on command.
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power off
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power on
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1482343/+subscriptions