← Back to team overview

kernel-packages team mailing list archive

[Bug 1339199] [NEW] mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

 

Public bug reported:

mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.

^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146

^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc
64le ppc64le ppc64le GNU/Linux

^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:

P8 server c656f2n05 crashed again.

I have launched openshmem regression with the efix of D198270 before the server
crashed.

Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.

2. su - qixiaol

3. cd /u/qixiaol/fvt/openshmem/get

4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe

5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all


And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
/afs/apd/u/shapiro/gpfs/PRO140708.12
(12:52:10) c199sq03:/u/shapiro/le # cat /afs/apd/u/shapiro/gpfs/PRO140708.11
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.

^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146

^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:

P8 server c656f2n05 crashed again.

I have launched openshmem regression with the efix of D198270 before the
server crashed.

Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.

2. su - qixiaol

3. cd /u/qixiaol/fvt/openshmem/get

4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe

5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all


And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]> 

The binary file of getmem.64 is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/bin/ppc64le/getmem.64

And its source code is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/src/getmem.c

Best Wishes

Xiao Lu Qi(???)

^^^^^^^^
SYMPTOM:
^^^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0

c656f2n05 login: [418442.082487]  rport-1:0-3: blocked FC remote port time out: removing rport
[418442.082492]  rport-5:0-6: blocked FC remote port time out: removing rport
[418442.082494]  rport-2:0-3: blocked FC remote port time out: removing rport
[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
    pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] 
kernel BUG at /b  uild/buildd/linux-3.13.0/mm/slub.c:3365!
  lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
    sp: c000001f39a13590
   msr: 9000000000029033
  current = 0xc000001f58828bf0
  paca    = 0xc00000000fe45780   softe: 0        irq_happened: 0x01
    pid   = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
                                                                .c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
    pc: c00000000022fa34: .kfree+0x124/0x220
    lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
    sp: c000001f39a6b590
   msr: 9000000000029033
  current = 0xc000001f5886f840
  paca    = 0xc00000000fe43b80   softe: 0        irq_happened: 0x01
    pid   = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon> 

  t     print backtrace

19:mon> t
[c000001f39a13620] c0000000007fb6a8 .skb_free_head+0x78/0xb0
[c000001f39a136a0] c0000000007fb904 .__kfree_skb+0x24/0x40
[c000001f39a13720] c00000000080396c .skb_free_datagram_locked+0xbc/0x140
[c000001f39a137b0] c0000000008992c0 .udp_recvmsg+0x1b0/0x530
[c000001f39a13890] c0000000008a6e70 .inet_recvmsg+0xa0/0x100
[c000001f39a13940] c0000000007ee7d8 .sock_recvmsg+0x108/0x160
[c000001f39a13ab0] c0000000007ef700 .___sys_recvmsg+0x150/0x320
[c000001f39a13c90] c0000000007f2858 .__sys_recvmsg+0x58/0xc0
[c000001f39a13d70] c0000000007f30ec .SyS_socketcall+0x38c/0x3f0
[c000001f39a13e30] c00000000000a158 syscall_exit+0x0/0x98
--- Exception: c01 (System Call) at 00003ffdbf7c10bc
SP (3fffc777abc0) is in userspace
19:mon> 

^^^^^^
DEBUG:
^^^^^^
Hi Xiao Lu,

A user space job should not crash the node.  If the process uses too
much memory, then the process itself should get ENOMEM and the node
should not crash.  Could you please check the system error log to see
what was the real issue causes the node to crash?

>From the information Dave provided, it looks like we triggered some bug
in kernel accidentally.  I believe we need someone from LTC to check it.

[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
    pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] 
kernel BUG at /b  uild/buildd/linux-3.13.0/mm/slub.c:3365!
  lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
    sp: c000001f39a13590
   msr: 9000000000029033
  current = 0xc000001f58828bf0
  paca    = 0xc00000000fe45780   softe: 0        irq_happened: 0x01
    pid   = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
                                                                .c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
    pc: c00000000022fa34: .kfree+0x124/0x220
    lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
    sp: c000001f39a6b590
   msr: 9000000000029033
  current = 0xc000001f5886f840
  paca    = 0xc00000000fe43b80   softe: 0        irq_happened: 0x01
    pid   = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon>

** Affects: makedumpfile (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1339199

Title:
  mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

Status in “makedumpfile” package in Ubuntu:
  New

Bug description:
  mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

  ^^^^^^^^
  SUMMARY:
  ^^^^^^^^
  mpi user job launch getmem.64 user job crashes ubuntu 14.04
  We still have the node c656f2n05 in debug with console opened
  and can give access when needed.

  ^^^^^^^^^^^^^^
  CONFIGURATION:
  ^^^^^^^^^^^^^^
  Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
  c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
  c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
  c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
  c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146

  ^^^^^^
  BUILD:
  ^^^^^^
  Ubuntu 14.04 LTS c656f2n05 hvc0
  Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc
  64le ppc64le ppc64le GNU/Linux

  ^^^^^^^^^
  SCENARIO:
  ^^^^^^^^^
  Hi, all:

  P8 server c656f2n05 crashed again.

  I have launched openshmem regression with the efix of D198270 before the server
  crashed.

  Here below are the steps that I have used to launched the test:
  ---------------------
  1. ssh c656f2n03 with root/davega.

  2. su - qixiaol

  3. cd /u/qixiaol/fvt/openshmem/get

  4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
  MP_ADAPTER_USE=shared
  MP_EUIDEVICE=sn_all
  MP_EUILIB=us
  MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
  MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
  MP_PROCS=8
  MP_RESD=poe

  5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all

  
  And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
  -------------------------------
  [c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
  /afs/apd/u/shapiro/gpfs/PRO140708.12
  (12:52:10) c199sq03:/u/shapiro/le # cat /afs/apd/u/shapiro/gpfs/PRO140708.11
  mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE

  ^^^^^^^^
  SUMMARY:
  ^^^^^^^^
  mpi user job launch getmem.64 user job crashes ubuntu 14.04
  We still have the node c656f2n05 in debug with console opened
  and can give access when needed.

  ^^^^^^^^^^^^^^
  CONFIGURATION:
  ^^^^^^^^^^^^^^
  Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
  c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
  c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
  c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
  c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146

  ^^^^^^
  BUILD:
  ^^^^^^
  Ubuntu 14.04 LTS c656f2n05 hvc0
  Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

  ^^^^^^^^^
  SCENARIO:
  ^^^^^^^^^
  Hi, all:

  P8 server c656f2n05 crashed again.

  I have launched openshmem regression with the efix of D198270 before
  the server crashed.

  Here below are the steps that I have used to launched the test:
  ---------------------
  1. ssh c656f2n03 with root/davega.

  2. su - qixiaol

  3. cd /u/qixiaol/fvt/openshmem/get

  4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
  MP_ADAPTER_USE=shared
  MP_EUIDEVICE=sn_all
  MP_EUILIB=us
  MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
  MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
  MP_PROCS=8
  MP_RESD=poe

  5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all

  
  And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
  -------------------------------
  [c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]> 

  The binary file of getmem.64 is at
  c656f2n03:/u/qixiaol/fvt/openshmem/get/bin/ppc64le/getmem.64

  And its source code is at
  c656f2n03:/u/qixiaol/fvt/openshmem/get/src/getmem.c

  Best Wishes

  Xiao Lu Qi(???)

  ^^^^^^^^
  SYMPTOM:
  ^^^^^^^^
  Ubuntu 14.04 LTS c656f2n05 hvc0

  c656f2n05 login: [418442.082487]  rport-1:0-3: blocked FC remote port time out: removing rport
  [418442.082492]  rport-5:0-6: blocked FC remote port time out: removing rport
  [418442.082494]  rport-2:0-3: blocked FC remote port time out: removing rport
  [466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
      pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] 
  kernel BUG at /b  uild/buildd/linux-3.13.0/mm/slub.c:3365!
    lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
      sp: c000001f39a13590
     msr: 9000000000029033
    current = 0xc000001f58828bf0
    paca    = 0xc00000000fe45780   softe: 0        irq_happened: 0x01
      pid   = 98622, comm = getmem.64
  [466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  [466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
                                                                  .c:3365!
  cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
      pc: c00000000022fa34: .kfree+0x124/0x220
      lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
      sp: c000001f39a6b590
     msr: 9000000000029033
    current = 0xc000001f5886f840
    paca    = 0xc00000000fe43b80   softe: 0        irq_happened: 0x01
      pid   = 98621, comm = getmem.64
  kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  enter ? for help
  19:mon> 

    t     print backtrace

  19:mon> t
  [c000001f39a13620] c0000000007fb6a8 .skb_free_head+0x78/0xb0
  [c000001f39a136a0] c0000000007fb904 .__kfree_skb+0x24/0x40
  [c000001f39a13720] c00000000080396c .skb_free_datagram_locked+0xbc/0x140
  [c000001f39a137b0] c0000000008992c0 .udp_recvmsg+0x1b0/0x530
  [c000001f39a13890] c0000000008a6e70 .inet_recvmsg+0xa0/0x100
  [c000001f39a13940] c0000000007ee7d8 .sock_recvmsg+0x108/0x160
  [c000001f39a13ab0] c0000000007ef700 .___sys_recvmsg+0x150/0x320
  [c000001f39a13c90] c0000000007f2858 .__sys_recvmsg+0x58/0xc0
  [c000001f39a13d70] c0000000007f30ec .SyS_socketcall+0x38c/0x3f0
  [c000001f39a13e30] c00000000000a158 syscall_exit+0x0/0x98
  --- Exception: c01 (System Call) at 00003ffdbf7c10bc
  SP (3fffc777abc0) is in userspace
  19:mon> 

  ^^^^^^
  DEBUG:
  ^^^^^^
  Hi Xiao Lu,

  A user space job should not crash the node.  If the process uses too
  much memory, then the process itself should get ENOMEM and the node
  should not crash.  Could you please check the system error log to see
  what was the real issue causes the node to crash?

  From the information Dave provided, it looks like we triggered some
  bug in kernel accidentally.  I believe we need someone from LTC to
  check it.

  [466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
      pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] 
  kernel BUG at /b  uild/buildd/linux-3.13.0/mm/slub.c:3365!
    lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
      sp: c000001f39a13590
     msr: 9000000000029033
    current = 0xc000001f58828bf0
    paca    = 0xc00000000fe45780   softe: 0        irq_happened: 0x01
      pid   = 98622, comm = getmem.64
  [466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  [466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
                                                                  .c:3365!
  cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
      pc: c00000000022fa34: .kfree+0x124/0x220
      lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
      sp: c000001f39a6b590
     msr: 9000000000029033
    current = 0xc000001f5886f840
    paca    = 0xc00000000fe43b80   softe: 0        irq_happened: 0x01
      pid   = 98621, comm = getmem.64
  kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
  enter ? for help
  19:mon>

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1339199/+subscriptions


Follow ups

References