kernel-packages team mailing list archive

Thread
Date
[Bug 1432837] Re: HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Rafael David Tinoco <inaddy@xxxxxxxxxx>
Date: Wed, 18 Mar 2015 15:14:03 -0000
Reply-to: Bug 1432837 <1432837@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Sorry, there is a misunderstanding regarding the case and this bug.

This is not the ANSWER for the reported bug, just a clarification on
what the kernel team has decided to do way before this case. All
watchdogs are blacklisted by default in Ubuntu and can be enabled if
needed (like for example a case where corosync wants to rely on HW
watchdog for making sure that there are no split brains and things
like that).

Per kernel team comments (on kernel-team mailing list):

"""
We have been seeing random crashs from various HP systems, this has
been tracked to loading of the hpwdt watchdog modules.  Basically these
modules are a loaded gun and unless you know exactly what you are doing
you are likely to take off your own head.  For this reason we already
blacklist "all" of these modules in kmod/module-in-tools blacklists.
Unfortuantly these have not been kept in sync with the kernel leading to
the module loading.
"""

This is actually not a resolution for this particular case, but a bug
(from a previous decision of blacklisting them all).

Of course we shall recommend the HW watchdog interface for 2 node
cluster setups, for example, when we can't rely on quorum policies and
fencing mechanisms are not available (like external network for
powering nodes down and things like that).

Regarding the usage of watchdog on top of corosync and
synchronization, yes I agree... this is something I'll pursue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1432837

Title:
  HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT
  module loaded

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Precise:
  Fix Committed
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Utopic:
  Fix Committed

Bug description:
  It was brought to me several situations where users where facing
  kernel panics when machine was apparently idling (for some HP Proliant
  Servers like DL 360, DL 380).

  ILO:

  "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An
  Unrecoverable System Error (NMI) has occurred (System error code
  0x0000002B, 0x00000000)"

  Examples:

  PID: 0      TASK: ffffffff81c1a480  CPU: 0   COMMAND: "swapper/0"
   #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
   #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
   #2 [ffff88085fc05da0] panic at ffffffff8175b3f2
   #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
   #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
   #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
   #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
   #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
   #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
      [exception RIP: native_safe_halt+6]
      RIP: ffffffff81055186  RSP: ffffffff81c03e90  RFLAGS: 00000246
      RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
      RDX: ffffffff81c03e90  RSI: 0000000000000018  RDI: 0000000000000001
      RBP: ffffffff81055186   R8: ffffffff81055186   R9: 0000000000000018
      R10: ffffffff81c03e90  R11: 0000000000000246  R12: ffffffffffffffff
      R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
  --- <DOUBLEFAULT exception stack> ---
   #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
  #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f
  #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf
  #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325
  #13 [ffffffff81c03f40] rest_init at ffffffff81751a37
  #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7
  #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee
  #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733

  OR

  PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0"
  #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391
  #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8
  #2 [ffff880fffa07d80] panic at ffffffff81730335
  #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt]
  #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a
  #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd
  #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0
  #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81
  [exception RIP: intel_idle+204]
  RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046
  RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046
  RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001
  RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018
  R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff
  R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000
  ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
  --- <NMI exception stack> ---
  #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec
  #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf

  It turned out that after investigating all idling situations and
  diverse kernel dump files - where we had most of the CPUs either
  MWAITing and or "relaxing", we discovered that HPWDT was loaded and
  corosync was opening /dev/watchdog file, triggering the ILO watchdog
  timer and not updating frequently enough as ILO expected.

  As described in /etc/modprobe.d/blacklist-watchdog.conf:

  """
  # Watchdog drivers should not be loaded automatically, but only if a
  # watchdog daemon is installed.
  """

  We should blacklist module "hpwdt" by default for all Ubuntu versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837/+subscriptions
References

[Bug 1432837] [NEW] HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded
From: Rafael David Tinoco, 2015-03-16