kernel-packages team mailing list archive

Thread
Date
[Bug 1432837] Re: HP Proliant Servers should not have HPWDT module loaded automatically

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Rafael David Tinoco <inaddy@xxxxxxxxxx>
Date: Mon, 16 Mar 2015 21:10:51 -0000
Reply-to: Bug 1432837 <1432837@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
I developed a small tool based on inotify to help users to check if
their watchdog is being used.

Anyone can find instructions on how to run it here:

https://github.com/inaddy/notifymydog

Small Example:

inaddy@host:~$ wget https://raw.githubusercontent.com/inaddy/notifymydog/master/notifymydog.c 
inaddy@host:~/notifymydog$ gcc -Wall -D_DEBUG=0 -D_SYSLOG=1 notifymydog.c -o notifymydog 
inaddy@host:~/notifymydog$ sudo ./notifymydog & 
inaddy@host:~$ sudo tail -f /var/log/syslog 
Mar 16 17:36:26 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED 
Mar 16 17:36:40 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED 
Mar 16 17:36:44 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS CLOSED 
Mar 16 17:36:49 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS OPENED 

So if you ever got a kernel panic on a HP Proliant Server DL360 and/or
DL380 with no apparent reason and the stack trace shows NMIs generate,
confirm if none of your userland programs have opened /dev/watchdog on
purpose (not updating it frequent enough) and by accident (causing the
watchdog HW to be triggered and panic'ing the machine after some time).

Workaround:

# echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf 
# update-initramfs -k all -u 
# upgrade-grub 
# reboot 

** Summary changed:

- HP Proliant Servers should not have HPWDT module loaded automatically
+ HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module loaded

** Summary changed:

- HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module loaded
+ HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded

** Description changed:

  It was brought to me several situations where users where facing kernel
- panics when machine was apparently idling:
+ panics when machine was apparently idling (for some HP Proliant Servers
+ like DL 360, DL 380).
  
  Examples:
  
  PID: 0      TASK: ffffffff81c1a480  CPU: 0   COMMAND: "swapper/0"
-  #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
-  #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
-  #2 [ffff88085fc05da0] panic at ffffffff8175b3f2
-  #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
-  #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
-  #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
-  #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
-  #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
-  #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
-     [exception RIP: native_safe_halt+6]
-     RIP: ffffffff81055186  RSP: ffffffff81c03e90  RFLAGS: 00000246
-     RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
-     RDX: ffffffff81c03e90  RSI: 0000000000000018  RDI: 0000000000000001
-     RBP: ffffffff81055186   R8: ffffffff81055186   R9: 0000000000000018
-     R10: ffffffff81c03e90  R11: 0000000000000246  R12: ffffffffffffffff
-     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
-     ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
+  #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
+  #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
+  #2 [ffff88085fc05da0] panic at ffffffff8175b3f2
+  #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
+  #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
+  #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
+  #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
+  #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
+  #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
+     [exception RIP: native_safe_halt+6]
+     RIP: ffffffff81055186  RSP: ffffffff81c03e90  RFLAGS: 00000246
+     RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
+     RDX: ffffffff81c03e90  RSI: 0000000000000018  RDI: 0000000000000001
+     RBP: ffffffff81055186   R8: ffffffff81055186   R9: 0000000000000018
+     R10: ffffffff81c03e90  R11: 0000000000000246  R12: ffffffffffffffff
+     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
+     ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
  --- <DOUBLEFAULT exception stack> ---
-  #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
+  #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
  #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f
  #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf
  #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325
  #13 [ffffffff81c03f40] rest_init at ffffffff81751a37
  #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7
  #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee
  #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733
  
  OR
  
- PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0" 
- #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391 
- #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8 
- #2 [ffff880fffa07d80] panic at ffffffff81730335 
- #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt] 
- #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a 
- #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd 
- #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0 
- #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81 
- [exception RIP: intel_idle+204] 
- RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046 
- RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046 
- RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001 
- RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018 
- R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff 
- R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000 
- ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 
- --- <NMI exception stack> --- 
- #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec 
- #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf 
+ PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0"
+ #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391
+ #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8
+ #2 [ffff880fffa07d80] panic at ffffffff81730335
+ #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt]
+ #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a
+ #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd
+ #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0
+ #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81
+ [exception RIP: intel_idle+204]
+ RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046
+ RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046
+ RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001
+ RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018
+ R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff
+ R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000
+ ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
+ --- <NMI exception stack> ---
+ #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec
+ #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf
  
  It turned out that after investigating all idling situations and diverse
  kernel dump files - where we had most of the CPUs either MWAITing and or
  "relaxing", we discovered that HPWDT was loaded and corosync was opening
  /dev/watchdog file, triggering the ILO watchdog timer and not updating
  frequently enough as ILO expected.
  
  As described in /etc/modprobe.d/blacklist-watchdog.conf:
  
  """
  # Watchdog drivers should not be loaded automatically, but only if a
  # watchdog daemon is installed.
  """
  
  We should blacklist module "hpwdt" by default for all Ubuntu versions.

** Description changed:

  It was brought to me several situations where users where facing kernel
  panics when machine was apparently idling (for some HP Proliant Servers
  like DL 360, DL 380).
+ 
+ ILO:
+ 
+ "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An
+ Unrecoverable System Error (NMI) has occurred (System error code
+ 0x0000002B, 0x00000000)"
  
  Examples:
  
  PID: 0      TASK: ffffffff81c1a480  CPU: 0   COMMAND: "swapper/0"
   #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
   #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
   #2 [ffff88085fc05da0] panic at ffffffff8175b3f2
   #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
   #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
   #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
   #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
   #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
   #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
      [exception RIP: native_safe_halt+6]
      RIP: ffffffff81055186  RSP: ffffffff81c03e90  RFLAGS: 00000246
      RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
      RDX: ffffffff81c03e90  RSI: 0000000000000018  RDI: 0000000000000001
      RBP: ffffffff81055186   R8: ffffffff81055186   R9: 0000000000000018
      R10: ffffffff81c03e90  R11: 0000000000000246  R12: ffffffffffffffff
      R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
  --- <DOUBLEFAULT exception stack> ---
   #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
  #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f
  #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf
  #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325
  #13 [ffffffff81c03f40] rest_init at ffffffff81751a37
  #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7
  #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee
  #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733
  
  OR
  
  PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0"
  #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391
  #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8
  #2 [ffff880fffa07d80] panic at ffffffff81730335
  #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt]
  #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a
  #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd
  #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0
  #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81
  [exception RIP: intel_idle+204]
  RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046
  RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046
  RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001
  RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018
  R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff
  R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000
  ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
  --- <NMI exception stack> ---
  #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec
  #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf
  
  It turned out that after investigating all idling situations and diverse
  kernel dump files - where we had most of the CPUs either MWAITing and or
  "relaxing", we discovered that HPWDT was loaded and corosync was opening
  /dev/watchdog file, triggering the ILO watchdog timer and not updating
  frequently enough as ILO expected.
  
  As described in /etc/modprobe.d/blacklist-watchdog.conf:
  
  """
  # Watchdog drivers should not be loaded automatically, but only if a
  # watchdog daemon is installed.
  """
  
  We should blacklist module "hpwdt" by default for all Ubuntu versions.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1432837

Title:
  HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT
  module loaded

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  It was brought to me several situations where users where facing
  kernel panics when machine was apparently idling (for some HP Proliant
  Servers like DL 360, DL 380).

  ILO:

  "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An
  Unrecoverable System Error (NMI) has occurred (System error code
  0x0000002B, 0x00000000)"

  Examples:

  PID: 0      TASK: ffffffff81c1a480  CPU: 0   COMMAND: "swapper/0"
   #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
   #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
   #2 [ffff88085fc05da0] panic at ffffffff8175b3f2
   #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
   #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
   #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
   #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
   #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
   #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
      [exception RIP: native_safe_halt+6]
      RIP: ffffffff81055186  RSP: ffffffff81c03e90  RFLAGS: 00000246
      RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
      RDX: ffffffff81c03e90  RSI: 0000000000000018  RDI: 0000000000000001
      RBP: ffffffff81055186   R8: ffffffff81055186   R9: 0000000000000018
      R10: ffffffff81c03e90  R11: 0000000000000246  R12: ffffffffffffffff
      R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
  --- <DOUBLEFAULT exception stack> ---
   #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
  #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f
  #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf
  #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325
  #13 [ffffffff81c03f40] rest_init at ffffffff81751a37
  #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7
  #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee
  #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733

  OR

  PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0"
  #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391
  #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8
  #2 [ffff880fffa07d80] panic at ffffffff81730335
  #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt]
  #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a
  #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd
  #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0
  #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81
  [exception RIP: intel_idle+204]
  RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046
  RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046
  RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001
  RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018
  R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff
  R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000
  ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
  --- <NMI exception stack> ---
  #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec
  #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf

  It turned out that after investigating all idling situations and
  diverse kernel dump files - where we had most of the CPUs either
  MWAITing and or "relaxing", we discovered that HPWDT was loaded and
  corosync was opening /dev/watchdog file, triggering the ILO watchdog
  timer and not updating frequently enough as ILO expected.

  As described in /etc/modprobe.d/blacklist-watchdog.conf:

  """
  # Watchdog drivers should not be loaded automatically, but only if a
  # watchdog daemon is installed.
  """

  We should blacklist module "hpwdt" by default for all Ubuntu versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837/+subscriptions
References

[Bug 1432837] [NEW] HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded
From: Rafael David Tinoco, 2015-03-16