kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #70313
[Bug 1339199] [NEW] mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
Public bug reported:
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.
^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146
^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc
64le ppc64le ppc64le GNU/Linux
^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:
P8 server c656f2n05 crashed again.
I have launched openshmem regression with the efix of D198270 before the server
crashed.
Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.
2. su - qixiaol
3. cd /u/qixiaol/fvt/openshmem/get
4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe
5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all
And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
/afs/apd/u/shapiro/gpfs/PRO140708.12
(12:52:10) c199sq03:/u/shapiro/le # cat /afs/apd/u/shapiro/gpfs/PRO140708.11
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.
^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146
^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:
P8 server c656f2n05 crashed again.
I have launched openshmem regression with the efix of D198270 before the
server crashed.
Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.
2. su - qixiaol
3. cd /u/qixiaol/fvt/openshmem/get
4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe
5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all
And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
The binary file of getmem.64 is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/bin/ppc64le/getmem.64
And its source code is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/src/getmem.c
Best Wishes
Xiao Lu Qi(???)
^^^^^^^^
SYMPTOM:
^^^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
c656f2n05 login: [418442.082487] rport-1:0-3: blocked FC remote port time out: removing rport
[418442.082492] rport-5:0-6: blocked FC remote port time out: removing rport
[418442.082494] rport-2:0-3: blocked FC remote port time out: removing rport
[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019]
kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365!
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a13590
msr: 9000000000029033
current = 0xc000001f58828bf0
paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01
pid = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
.c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
pc: c00000000022fa34: .kfree+0x124/0x220
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a6b590
msr: 9000000000029033
current = 0xc000001f5886f840
paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01
pid = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon>
t print backtrace
19:mon> t
[c000001f39a13620] c0000000007fb6a8 .skb_free_head+0x78/0xb0
[c000001f39a136a0] c0000000007fb904 .__kfree_skb+0x24/0x40
[c000001f39a13720] c00000000080396c .skb_free_datagram_locked+0xbc/0x140
[c000001f39a137b0] c0000000008992c0 .udp_recvmsg+0x1b0/0x530
[c000001f39a13890] c0000000008a6e70 .inet_recvmsg+0xa0/0x100
[c000001f39a13940] c0000000007ee7d8 .sock_recvmsg+0x108/0x160
[c000001f39a13ab0] c0000000007ef700 .___sys_recvmsg+0x150/0x320
[c000001f39a13c90] c0000000007f2858 .__sys_recvmsg+0x58/0xc0
[c000001f39a13d70] c0000000007f30ec .SyS_socketcall+0x38c/0x3f0
[c000001f39a13e30] c00000000000a158 syscall_exit+0x0/0x98
--- Exception: c01 (System Call) at 00003ffdbf7c10bc
SP (3fffc777abc0) is in userspace
19:mon>
^^^^^^
DEBUG:
^^^^^^
Hi Xiao Lu,
A user space job should not crash the node. If the process uses too
much memory, then the process itself should get ENOMEM and the node
should not crash. Could you please check the system error log to see
what was the real issue causes the node to crash?
>From the information Dave provided, it looks like we triggered some bug
in kernel accidentally. I believe we need someone from LTC to check it.
[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019]
kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365!
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a13590
msr: 9000000000029033
current = 0xc000001f58828bf0
paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01
pid = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
.c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
pc: c00000000022fa34: .kfree+0x124/0x220
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a6b590
msr: 9000000000029033
current = 0xc000001f5886f840
paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01
pid = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon>
** Affects: makedumpfile (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1339199
Title:
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
Status in “makedumpfile” package in Ubuntu:
New
Bug description:
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.
^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146
^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc
64le ppc64le ppc64le GNU/Linux
^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:
P8 server c656f2n05 crashed again.
I have launched openshmem regression with the efix of D198270 before the server
crashed.
Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.
2. su - qixiaol
3. cd /u/qixiaol/fvt/openshmem/get
4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe
5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all
And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
/afs/apd/u/shapiro/gpfs/PRO140708.12
(12:52:10) c199sq03:/u/shapiro/le # cat /afs/apd/u/shapiro/gpfs/PRO140708.11
mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE
^^^^^^^^
SUMMARY:
^^^^^^^^
mpi user job launch getmem.64 user job crashes ubuntu 14.04
We still have the node c656f2n05 in debug with console opened
and can give access when needed.
^^^^^^^^^^^^^^
CONFIGURATION:
^^^^^^^^^^^^^^
Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net )
c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143
c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144
c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed
c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146
^^^^^^
BUILD:
^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
^^^^^^^^^
SCENARIO:
^^^^^^^^^
Hi, all:
P8 server c656f2n05 crashed again.
I have launched openshmem regression with the efix of D198270 before
the server crashed.
Here below are the steps that I have used to launched the test:
---------------------
1. ssh c656f2n03 with root/davega.
2. su - qixiaol
3. cd /u/qixiaol/fvt/openshmem/get
4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP
MP_ADAPTER_USE=shared
MP_EUIDEVICE=sn_all
MP_EUILIB=us
MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/
MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list
MP_PROCS=8
MP_RESD=poe
5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all
And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS:
-------------------------------
[c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]>
The binary file of getmem.64 is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/bin/ppc64le/getmem.64
And its source code is at
c656f2n03:/u/qixiaol/fvt/openshmem/get/src/getmem.c
Best Wishes
Xiao Lu Qi(???)
^^^^^^^^
SYMPTOM:
^^^^^^^^
Ubuntu 14.04 LTS c656f2n05 hvc0
c656f2n05 login: [418442.082487] rport-1:0-3: blocked FC remote port time out: removing rport
[418442.082492] rport-5:0-6: blocked FC remote port time out: removing rport
[418442.082494] rport-2:0-3: blocked FC remote port time out: removing rport
[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019]
kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365!
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a13590
msr: 9000000000029033
current = 0xc000001f58828bf0
paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01
pid = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
.c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
pc: c00000000022fa34: .kfree+0x124/0x220
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a6b590
msr: 9000000000029033
current = 0xc000001f5886f840
paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01
pid = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon>
t print backtrace
19:mon> t
[c000001f39a13620] c0000000007fb6a8 .skb_free_head+0x78/0xb0
[c000001f39a136a0] c0000000007fb904 .__kfree_skb+0x24/0x40
[c000001f39a13720] c00000000080396c .skb_free_datagram_locked+0xbc/0x140
[c000001f39a137b0] c0000000008992c0 .udp_recvmsg+0x1b0/0x530
[c000001f39a13890] c0000000008a6e70 .inet_recvmsg+0xa0/0x100
[c000001f39a13940] c0000000007ee7d8 .sock_recvmsg+0x108/0x160
[c000001f39a13ab0] c0000000007ef700 .___sys_recvmsg+0x150/0x320
[c000001f39a13c90] c0000000007f2858 .__sys_recvmsg+0x58/0xc0
[c000001f39a13d70] c0000000007f30ec .SyS_socketcall+0x38c/0x3f0
[c000001f39a13e30] c00000000000a158 syscall_exit+0x0/0x98
--- Exception: c01 (System Call) at 00003ffdbf7c10bc
SP (3fffc777abc0) is in userspace
19:mon>
^^^^^^
DEBUG:
^^^^^^
Hi Xiao Lu,
A user space job should not crash the node. If the process uses too
much memory, then the process itself should get ENOMEM and the node
should not crash. Could you please check the system error log to see
what was the real issue causes the node to crash?
From the information Dave provided, it looks like we triggered some
bug in kernel accidentally. I believe we need someone from LTC to
check it.
[466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310]
pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019]
kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365!
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a13590
msr: 9000000000029033
current = 0xc000001f58828bf0
paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01
pid = 98622, comm = getmem.64
[466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
[466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub
.c:3365!
cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310]
pc: c00000000022fa34: .kfree+0x124/0x220
lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0
sp: c000001f39a6b590
msr: 9000000000029033
current = 0xc000001f5886f840
paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01
pid = 98621, comm = getmem.64
kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365!
enter ? for help
19:mon>
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1339199/+subscriptions
Follow ups
References