← Back to team overview

kernel-packages team mailing list archive

[Bug 1546442] Re: ISST-LTE: Ubuntu 16.04 LPAR has Kernel panic when running base, IO, NFS, TCP tests together

 

** Package changed: ubuntu => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1546442

Title:
  ISST-LTE: Ubuntu 16.04 LPAR has Kernel panic when running base, IO,
  NFS, TCP tests together

Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - YUECHANG E. MEI  - 2016-02-10 16:58:26 ==
  ---Problem Description---
  Our LPAR, conelp1, has Ubuntu 16.04 installed and is using Houston adapter for network and SAN disks. 

  When we ran base, IO, and TCP tests on conelp1, it would stop the
  tests by itself or hung after the test ran for 10+ hrs.

  Then, I tried to increase min_free_kbytes and started base, IO, NFS,
  and TCP tests on it.  Later,  it has Kernel panic after it ran the ST
  tests for an hour.

  root@conelp1:~# echo 365536 > /proc/sys/vm/min_free_kbytes
  root@conelp1:~# cat /proc/sys/vm/min_free_kbytes
  365536

   root@conelp1:~# [ 1682.274535] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
  [ 1682.274535]
  [ 1682.274567] CPU: 11 PID: 1 Comm: systemd Not tainted 4.4.0-2-generic #16-Ubuntu
  [ 1682.274573] Call Trace:
  [ 1682.274583] [c00000017ad83a60] [c000000000ad5ec0] dump_stack+0x90/0xbc (unreliable)
  [ 1682.274593] [c00000017ad83a90] [c000000000ad2140] panic+0x100/0x2c8
  [ 1682.274601] [c00000017ad83b20] [c0000000000bce18] do_exit+0xbe8/0xbf0
  [ 1682.274608] [c00000017ad83be0] [c0000000000bcf04] do_group_exit+0x64/0x100
  [ 1682.274616] [c00000017ad83c20] [c0000000000ce23c] get_signal+0x52c/0x770
  [ 1682.274626] [c00000017ad83d10] [c0000000000173d4] do_signal+0x54/0x2b0
  [ 1682.274633] [c00000017ad83e00] [c00000000001782c] do_notify_resume+0xbc/0xd0
  [ 1682.274641] [c00000017ad83e30] [c000000000009838] ret_from_except_lite+0x64/0x68
  [ 1682.284440] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
  [ 1682.284440]

  I tried to collect the dump when it crashed at the first time, but failed. conelp1 is in xmon now.
   

  ---uname output---
  4.4.0-2-generic 
   
  Machine Type = EUH Alpine 8408-E8E 
   
  ---Debugger Data---

  16:mon> r
  R00 = c0000000000439a8   R16 = c000000000f69c28
  R01 = c00000017ff27cf0   R17 = 0000000000000001
  R02 = c000000001583800   R18 = c00000017adfc000
  R03 = 1600000000000000   R19 = 0000000000000008
  R04 = 0000000000000000   R20 = c00000017adfc080
  R05 = 0000000000000000   R21 = 0000000000000001
  R06 = 0000000000000000   R22 = 0000000000000002
  R07 = 0000000000000000   R23 = 0000000000000010
  R08 = c0000000015beaa8   R24 = c00000017e2c0800
  R09 = 0000000000000000   R25 = 0000000000000000
  R10 = 0000000000000000   R26 = 0000000000000000
  R11 = 0000000000000002   R27 = 0000000000000001
  R12 = c000000000043980   R28 = 0000000000000000
  R13 = c000000007afd100   R29 = c000000000043980
  R14 = c000000000b03a18   R30 = 0000000000000000
  R15 = c000000000f7d670   R31 = 0000000000000000
  pc  = b8390400000000c0
  cfar= 000000000000011c
  lr  = c0000000000439a8 stop_this_cpu+0x28/0x40
  msr = 1000000000000080   cr  = 28002882
  ctr = c000000000043980   xer = 0000000000000000   trap =  100
  16:mon> e
  cpu 0x16: Vector: 100 (System Reset) at [c00000017ff27a70]
      pc: b8390400000000c0
      lr: c0000000000439a8: stop_this_cpu+0x28/0x40
      sp: c00000017ff27cf0
     msr: 1000000000000080
    current = 0xc00000017ad69370
    paca    = 0xc000000007afd100   softe: 0        irq_happened: 0x0d
      pid   = 0, comm = swapper/22
  16:mon> d
  0000000000000000 **************** ****************  |                |
  16:mon> t
  [c00000017ff27cf0] c0000000000439a8 stop_this_cpu+0x28/0x40 (unreliable)
  [c00000017ff27d10] c000000000168510 flush_smp_call_function_queue+0x120/0x1d0
  [c00000017ff27d90] c000000000044568 smp_ipi_demux+0x98/0x100
  [c00000017ff27dd0] c00000000006c494 icp_hv_ipi_action+0x64/0xd0
  [c00000017ff27e40] c00000000012f370 handle_irq_event_percpu+0x90/0x2b0
  [c00000017ff27f00] c0000000001351e4 handle_percpu_irq+0x84/0xd0
  [c00000017ff27f30] c00000000012e564 generic_handle_irq+0x54/0x80
  [c00000017ff27f60] c000000000011300 __do_irq+0x80/0x190
  [c00000017ff27f90] c000000000024760 call_do_irq+0x14/0x24
  [c00000017adff9f0] c0000000000114a8 do_IRQ+0x98/0x140
  [c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
  --- Exception: 501 (Hardware Interrupt) at c000000000085d5c plpar_hcall_norets+0x1c/0x28
  [link register   ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
  [c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50 (unreliable)
  [c00000017adffd90] c0000000008f6274 dedicated_cede_loop+0x74/0x190
  [c00000017adffdd0] c0000000008f3460 cpuidle_enter_state+0x160/0x3c0
  [c00000017adffe30] c000000000118c18 call_cpuidle+0x78/0xd0
  [c00000017adffe70] c000000000118fac cpu_startup_entry+0x33c/0x450
  [c00000017adfff30] c00000000004559c start_secondary+0x33c/0x360
  [c00000017adfff90] c000000000008b6c start_secondary_prolog+0x10/0x14

  
   
  ---Steps to Reproduce---
   1. Install Ubuntu 16.04 in LPAR (using SAN disks) with multipath disable (because of bug 136777)
  2. mount kte folder, and setup ST (base, IO, NFS, and TCP) tests
  3. increase the value of min_free_kbytes to  365536 (echo 365536 > /proc/sys/vm/min_free_kbytes)
  4. start ST  (base, IO, NFS, and TCP) tests
  5. leave the console open, and the LPAR will crash after an hour run
   
  Stack trace output:
  no

  
   
  Oops output:
   no
   
  System Dump Info:
    The system was configured to capture a dump, however a dump was not produced.
   

  == Comment: #3 - PAWAN K. SINGH - 2016-02-11 09:47:09 ==
  As per the log description it looks like energy management  issue like:-

  [c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
  --- Exception: 501 (Hardware Interrupt) at c000000000085d5c plpar_hcall_norets+0x1c/0x28
  [link register   ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
  [c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50 (unreliable)

  adding Shreyas  and Shilpa for further assistance

  Thanks ,

  
  == Comment: #6 - YUECHANG E. MEI  - 2016-02-12 18:25:54 ==

  We are still in the middle of recreating the issue. However, the ST
  tests running on LPARs (conelp1 and conelp2) were stopped by the
  LPARs, and same messages were printed before tests stopped. I am not
  sure if this related to the kernel panic problem, please advise if we
  need to open a new bug. Thank you!

  ....
  [Fri Feb 12 16:59:03 2016] systemd[1]: Starting NFS server and services...
  [Fri Feb 12 16:59:03 2016] systemd[1]: Started D-Bus System Message Bus.
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.PolicyKit1': Device or resource busy
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy
  [Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy

  (to see the detail of the message, please read the attached file)

  == Comment: #7 - Manjunatha H R  - 2016-02-15 01:38:17 ==
  Same issue is seen on roselp1, lpar crashed on fell into XMON.

  d:mon> t
  [link register   ] c000000000ad2300 panic+0x2c0/0x2c8
  [c000000276003a90] c000000000ad22b8 panic+0x278/0x2c8 (unreliable)
  [c000000276003b20] c0000000000bce18 do_exit+0xbe8/0xbf0
  [c000000276003be0] c0000000000bcf04 do_group_exit+0x64/0x100
  [c000000276003c20] c0000000000ce23c get_signal+0x52c/0x770
  [c000000276003d10] c0000000000173d4 do_signal+0x54/0x2b0
  [c000000276003e00] c00000000001782c do_notify_resume+0xbc/0xd0
  [c000000276003e30] c000000000009838 ret_from_except_lite+0x64/0x68
  --- Exception: 0  at 0101010101010100

  d:mon> e
  cpu 0xd: Vector: 100 (System Reset) at [c000000276003810]
      pc: 04f00100000000c0
      lr: c000000000ad2300: panic+0x2c0/0x2c8
      sp: c000000276003a90
     msr: 1000000200000080
    current = 0xc000000277f40000
    paca    = 0xc000000007af7b80   softe: 0        irq_happened: 0x01
      pid   = 1, comm = systemd
  d:mon>

  
  == Comment: #10 - Shreyas B. Prabhu  - 2016-02-15 02:11:26 ==
  (In reply to comment #3)
  > As per the log description it looks like energy management  issue like:-
  > 
  > [c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
  > --- Exception: 501 (Hardware Interrupt) at c000000000085d5c
  > plpar_hcall_norets+0x1c/0x28
  > [link register   ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
  > [c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50
  > (unreliable)
  > 
  > adding Shreyas  and Shilpa for further assistance
  > 
  > Thanks ,

  From the logs reported by Yuechang E. Mei and Manjunatha, the crash seems to be due to unhandled exceptions in systemd which lead to systemd crash (Note - "Attempted to kill init!" in the logs ).  Since the lpars were configured enter xmon upon crash, all the other cpu's would have got interrupts to bring them to xmon. Any cpu which was idle when it got this interrupt will have cpuidle functions in its call stack. 
  So its unlikely that this bug is related to cpuidle and is more likely a systemd bug.

  
  == Comment: #11 - Manjunatha H R  - 2016-02-15 02:43:51 ==
  Looks like this issue needs to be addressed by systemd expert :  Lpars which hit this bug has shown  : Crash OR Tests abort (all systemd services restarts)

  1. In case of crash: traces shows - 
  ---------------------------------
  [ 1682.274567] CPU: 11 PID: 1 Comm: systemd Not tainted 4.4.0-2-generic #16-Ubuntu
  [ 1682.274573] Call Trace:

  
  2. In case of Tests abort: dmesg shows - 
  -------------------------
  [220784.781637] systemd[1]: systemd-journald.service: Failed with result 'signal'.
  [220784.791216] systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
  [220784.791684] systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
  [220784.791730] systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
  [220784.791856] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
  [220784.792159] systemd[1]: systemd-udevd.service: Unit entered failed state.
  [220784.792181] systemd[1]: systemd-udevd.service: Failed with result 'signal'.
  [220784.792275] systemd[1]: accounts-daemon.service: Main process exited, code=killed, status=9/KILL
  [220784.792610] systemd[1]: accounts-daemon.service: Unit entered failed state.
  [220784.792624] systemd[1]: accounts-daemon.service: Failed with result 'signal'.
  [220784.792750] systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL
  [220784.830737] systemd[1]: systemd-logind.service: Main process exited, code=killed, status=9/KILL
  [220784.831167] systemd[1]: systemd-logind.service: Unit entered failed state.
  [220784.831223] systemd[1]: systemd-logind.service: Failed with result 'signal'.
  [220784.831342] systemd[1]: iprdump.service: Main process exited, code=killed, status=9/KILL
  [220784.831749] systemd[1]: iprdump.service: Unit entered failed state.
  [220784.831774] systemd[1]: iprdump.service: Failed with result 'signal'.
  [220784.831863] systemd[1]: cron.service: Main process exited, code=killed, status=9/KILL
  [220784.832216] systemd[1]: cron.service: Unit entered failed state.
  [220784.832230] systemd[1]: cron.service: Failed with result 'signal'.
  [220784.832305] systemd[1]: rtas_errd.service: Main process exited, code=killed, status=9/KILL
  [220784.832759] systemd[1]: rtas_errd.service: Unit entered failed state.
  [220784.832797] systemd[1]: rtas_errd.service: Failed with result 'signal'.
  [220784.832910] systemd[1]: polkitd.service: Main process exited, code=killed, status=9/KILL
  [220784.833269] systemd[1]: polkitd.service: Unit entered failed state.
  [220784.833284] systemd[1]: polkitd.service: Failed with result 'signal'.
  [220784.833400] systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL
  [220784.833660] systemd[1]: ssh.service: Unit entered failed state.
  [220784.833673] systemd[1]: ssh.service: Failed with result 'signal'.
  [220784.833757] systemd[1]: iprupdate.service: Main process exited, code=killed, status=9/KILL
  [220784.834088] systemd[1]: iprupdate.service: Unit entered failed state.
  [220784.834102] systemd[1]: iprupdate.service: Failed with result 'signal'.
  [220784.834173] systemd[1]: iprinit.service: Main process exited, code=killed, status=9/KILL
  [220784.834503] systemd[1]: iprinit.service: Unit entered failed state.
  [220784.834516] systemd[1]: iprinit.service: Failed with result 'signal'.
  [220784.834647] systemd[1]: postgresql@9.5-main.service: Main process exited, code=killed, status=9/KILL
  [220784.842900] systemd[1]: staf.service: Main process exited, code=killed, status=9/KILL
  [220784.844557] systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
  [220784.844903] systemd[1]: rsyslog.service: Unit entered failed state.
  [220784.844930] systemd[1]: rsyslog.service: Failed with result 'signal'.
  [220784.852532] systemd[1]: dbus.service: Unit entered failed state.
  [220784.852565] systemd[1]: dbus.service: Failed with result 'signal'.
  [220784.852987] systemd[1]: staf.service: Unit entered failed state.
  [220784.853005] systemd[1]: staf.service: Failed with result 'signal'.
  [220784.855176] systemd[1]: getty@tty1.service: Service has no hold-off time, scheduling restart.
  [220784.855473] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
  [220784.855504] systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
  [220784.856557] systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.
  [220784.893452] systemd[1]: lvm2-lvmetad.service: Service hold-off time over, scheduling restart.
  [220784.932383] systemd[1]: Stopped LVM2 metadata daemon.
  [220784.945979] systemd[1]: Started LVM2 metadata daemon.
  [220784.946532] systemd[1]: Stopped udev Kernel Device Manager.
  [220784.948518] systemd[1]: Starting udev Kernel Device Manager...
  [220784.948724] systemd[1]: Stopped Login Service.
  [220784.951208] systemd[1]: Starting Login Service...
  [220784.951649] systemd[1]: Stopped Flush Journal to Persistent Storage.
  [220784.951690] systemd[1]: Stopping Flush Journal to Persistent Storage...
  [220784.951872] systemd[1]: Stopped Journal Service.
  [220784.952634] systemd[1]: Starting Journal Service...
  [220784.952734] systemd[1]: Stopped Getty on tty1.
  [220784.954067] systemd[1]: Started Getty on tty1.
  [220784.960680] systemd[1]: rsyslog.service: Service hold-off time over, scheduling restart.
  [220784.960951] systemd[1]: serial-getty@hvc0.service: Service hold-off time over, scheduling restart.
  [220784.961186] systemd[1]: ssh.service: Service hold-off time over, scheduling restart.
  [220784.972830] systemd-journald[162446]: File /run/log/journal/01ce52241c3846edade41f23178de962/system.journal corrupted or uncleanly shut down, renaming and replacing.
  [220784.975464] systemd[1]: Started D-Bus System Message Bus.
  [220784.980448] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy
  [220784.980457] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy
  [220784.980463] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.PolicyKit1': Device or resource busy
  [220784.980470] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy
  [220784.980476] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy
  [220784.980482] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy
  [220785.011703] systemd[1]: Stopped OpenBSD Secure Shell server.
  [220785.013617] systemd[1]: Starting OpenBSD Secure Shell server...
  [220785.013748] systemd[1]: Stopped Serial Getty on hvc0.
  [220785.014853] systemd[1]: Started Serial Getty on hvc0.
  [220785.015193] systemd[1]: Stopped System Logging Service.
  [220785.020295] systemd[1]: Starting System Logging Service...
  [220785.020555] systemd[1]: Started Journal Service.


  Lpars affected:
  --------------------
  1. Crashed lpars so far : conelp1, roselp1, pinelp3

  2. Tests aborted lpars so far : conelp2, roselp2, pinelp3 and conelp1

  roslep1 is left for debugging this issue. Also please let us if a
  separate bug required to handle crash and test abort cases separately.

  Thanks,
  Manju

  == Comment: #13 - Ping Tian Han 2016-02-16 01:48:37 ==
  pinelp3 hitted this problem again with 4.4.0-4-generic. systemd killed and all ST stopped.

  == Comment: #16 - Ping Tian Han - 2016-02-17 02:03:57 ==
  systemd killed after running stress tests for about 5 hours on pinelp1. Kernel version: 4.4.0-4-generic.

  == Comment: #17 - PAWAN K. SINGH - 2016-02-17 02:15:08 ==
  form the previous comments it looks like a systemd issue thus we are mirroring so that distro is aware of it

  Thanks,

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1546442/+subscriptions