← Back to team overview

kernel-packages team mailing list archive

[Bug 1417580] [NEW] HP Proliant Servers should use proper cmdline to avoid kernel panics

 

Public bug reported:

This bug will try to consolidate all HP Proliant Bugs related to kernel
panics. Please do not use this to attach cores and/or files. Just to
provide feedback on the cmdline and its explanations.

We had several talks (in the last 2 weeks) with HP ROM Engineering Team
regarding NMIs (non maskable interrupts) being generated in some
situations:

- NMIs caused during MWAIT instruction (caused by intel_idle module): 
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318551)

HP strongly uses ACPI for its power management features. HP is one of
the most active members in the ACPI specification group and several
features for their servers, available through their firmware, are
heavily ACPI dependent. In the process of solving this and other bugs we
have discovered that intel_idle module did not use ACPI tables (a way of
firmware to say to OS what are the p-state/c-state values available) but
queried processor directly for the available c-states. This is,
probably, leading the OS to set a c-state (or sub-state) when the
firmware is not "prepared" to handle. We have provided the following
cmdline to be used: " intel_idle.max_cstate=0 ". This will tell OS to
deactivate intel_idle and activate acpi_idle module, which gets c-state
values to be used from the ACPI tables, given by firmware. HP is trying
to figure out what is generating the NMIs with intel_idle but it might
be the case to recommend all HP servers to deactivate intel_idle module
(in a near future).

- Recently discovered NMIs caused by a BUG in Intel microcode 
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1416414) 

I've discovered that there is a recent microcode problem in some Intel
Ivy Bridge microcode regarding a specific BIT not being cleared from the
PMU (performance counter) register. This can lead to a NMI being wrongly
handled (like if the PMU register was overflowed, without being) and a
kernel panic. We have backported the fix to Ubuntu-3.13.0-35.61. So it
is strongly advised that all Ubuntu Trusty Servers, running Xeon®
Processor E7 v2, to be upgraded "at least" to kernel 3.13.0-35". The
following cmdline: " nmi_watchdog=0 " can be used to disable regular x86
watchdog and use HP one but they don't recommend this to be the
"default" cmdline. HP was advised by Canonical regarding Intel Errata #
and that recommended workaround is a fix in firmware. Canonical has
provided a kernel patch to "workaround" the issue in non-patched
firmware (yet to be released by HP probably).

- X2APIC support for HP Proliant Servers 
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497)

During this investigation we had to clarify with HP ROM Engineering Team
whether this servers support X2APIC or not. It was told to us, by HP,
that all Gen8 (and more recent generations) do support X2APIC but they
still "ask" the OS to opt-out from X2APIC (not to use X2APIC). Running
the PIC (programmable interrupt controller) in XAPIC mode might not be
compatible with firmware if the CPU supports X2APIC because of one of
the only features that differs XAPIC from X2APIC: IRQ remapping (for
virtualization, basically). So it is recommended that on all HP Proliant
Servers Gen8, or newer, to use the following cmdline: "
intremap=no_x2apic_optout ".

Anyone affected, please provide proper feedback in this bug regarding
the use of those cmdlines (and kernel version) and tell me if new kernel
panics (regarding NMIs and/or APIC) happened on this Server Family. We
are getting feedback from community that these options are being enough
to avoid the Proliant Server Family to have kernel panics and they might
be released as a "public recommendation" for HP HW compatibility soon.

Thank you

Rafael Tinoco

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Rafael David Tinoco (inaddy)
         Status: Confirmed


** Tags: cts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1417580

Title:
  HP Proliant Servers should use proper cmdline to avoid kernel panics

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  This bug will try to consolidate all HP Proliant Bugs related to
  kernel panics. Please do not use this to attach cores and/or files.
  Just to provide feedback on the cmdline and its explanations.

  We had several talks (in the last 2 weeks) with HP ROM Engineering
  Team regarding NMIs (non maskable interrupts) being generated in some
  situations:

  - NMIs caused during MWAIT instruction (caused by intel_idle module): 
  (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318551)

  HP strongly uses ACPI for its power management features. HP is one of
  the most active members in the ACPI specification group and several
  features for their servers, available through their firmware, are
  heavily ACPI dependent. In the process of solving this and other bugs
  we have discovered that intel_idle module did not use ACPI tables (a
  way of firmware to say to OS what are the p-state/c-state values
  available) but queried processor directly for the available c-states.
  This is, probably, leading the OS to set a c-state (or sub-state) when
  the firmware is not "prepared" to handle. We have provided the
  following cmdline to be used: " intel_idle.max_cstate=0 ". This will
  tell OS to deactivate intel_idle and activate acpi_idle module, which
  gets c-state values to be used from the ACPI tables, given by
  firmware. HP is trying to figure out what is generating the NMIs with
  intel_idle but it might be the case to recommend all HP servers to
  deactivate intel_idle module (in a near future).

  - Recently discovered NMIs caused by a BUG in Intel microcode 
  (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1416414) 

  I've discovered that there is a recent microcode problem in some Intel
  Ivy Bridge microcode regarding a specific BIT not being cleared from
  the PMU (performance counter) register. This can lead to a NMI being
  wrongly handled (like if the PMU register was overflowed, without
  being) and a kernel panic. We have backported the fix to
  Ubuntu-3.13.0-35.61. So it is strongly advised that all Ubuntu Trusty
  Servers, running Xeon® Processor E7 v2, to be upgraded "at least" to
  kernel 3.13.0-35". The following cmdline: " nmi_watchdog=0 " can be
  used to disable regular x86 watchdog and use HP one but they don't
  recommend this to be the "default" cmdline. HP was advised by
  Canonical regarding Intel Errata # and that recommended workaround is
  a fix in firmware. Canonical has provided a kernel patch to
  "workaround" the issue in non-patched firmware (yet to be released by
  HP probably).

  - X2APIC support for HP Proliant Servers 
  (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497)

  During this investigation we had to clarify with HP ROM Engineering
  Team whether this servers support X2APIC or not. It was told to us, by
  HP, that all Gen8 (and more recent generations) do support X2APIC but
  they still "ask" the OS to opt-out from X2APIC (not to use X2APIC).
  Running the PIC (programmable interrupt controller) in XAPIC mode
  might not be compatible with firmware if the CPU supports X2APIC
  because of one of the only features that differs XAPIC from X2APIC:
  IRQ remapping (for virtualization, basically). So it is recommended
  that on all HP Proliant Servers Gen8, or newer, to use the following
  cmdline: " intremap=no_x2apic_optout ".

  Anyone affected, please provide proper feedback in this bug regarding
  the use of those cmdlines (and kernel version) and tell me if new
  kernel panics (regarding NMIs and/or APIC) happened on this Server
  Family. We are getting feedback from community that these options are
  being enough to avoid the Proliant Server Family to have kernel panics
  and they might be released as a "public recommendation" for HP HW
  compatibility soon.

  Thank you

  Rafael Tinoco

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580/+subscriptions


Follow ups

References