← Back to team overview

kernel-packages team mailing list archive

[Bug 1482343] Re: Trigger a checkstop on unrecoverable MCE/HMI errors to inform BMC/OCC about the error.

 

This bug was fixed in the package linux - 4.2.0-7.7

---------------
linux (4.2.0-7.7) wily; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1490564
  * rebase to v4.2

  [ Wen Xiong ]

  * SAUCE: ipr: Byte swapping for device_id attribute in sysfs
    - LP: #1453892

  [ Upstream Kernel Changes ]

  * rebase to v4.2
    - LP: #1487345

 -- Tim Gardner <tim.gardner@xxxxxxxxxxxxx>  Wed, 26 Aug 2015 07:06:10
-0600

** Changed in: linux (Ubuntu Wily)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1482343

Title:
  Trigger a checkstop on unrecoverable MCE/HMI errors to inform BMC/OCC
  about the error.

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Wily:
  Fix Released

Bug description:
  The current implementation of Machine Check handler and HMI handler in
  Linux, goes down kernel panic path for unrecoverable errors. On FSP
  based system FSP also gets notified about these errors which then
  forwards it to PRD (that runs on FSP) for error analysis and gard
  record creation.

  On OpenPower (BMC based system e.g. Habanero from TYAN) where PRD runs
  in Linux host, it never gets a chance to do error analysis at the time
  of Linux crash and no gard record is created for such errors. Since
  the faulty component never gets de-configured, the system is
  vulnerable to get hit by same HW error again.

  To fix this issue, a new OPAL call 'opal_cec_reboot2()' has been
  introduced to trigger a checkstop on BMC based system to inform
  BMC/OCC about this error, so that BMC can collect relevant data for
  error analysis and decide what component to de-configure before
  rebooting. Linux kernel should invoke this opal call for unrecoverable
  MCE and HMI instead before calling kernel panic so that OCC is
  informed about the error.

  The kernel changes has already been posted to upstream and are listed
  below:

  https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128341.html
  https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128342.html
  https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132045.html
  https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132114.html 

  Above patches needs to be included in ubuntu 14.04.3+

  We will update this bug with commit ids, once the above patches are
  accepted upstream.

  Contact Information = mahesh.salgaonkar@xxxxxxxxxx 
   
  ---uname output---
  Linux rcx2d403 3.19.0-26-generic #27 SMP Tue Aug 4 01:38:15 CDT 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  Habanero pass2 system 

   
  Machine Type = OpenPower, Habanero 
   
  ---System Hang---
   If system is hung, it can be recovered by sending ipmi power off/on command.
  $ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power off
  $ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power on

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1482343/+subscriptions