← Back to team overview

kernel-packages team mailing list archive

[Bug 1557379] Re: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [cicc:15164] noticed during compilation of CUDA Toolkit

 

** Package changed: ubuntu => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1557379

Title:
  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [cicc:15164]
  noticed during compilation of CUDA Toolkit

Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-01-18 05:50:42 ==
  During compilation of NVIDIA_CUDA-7.5 Toolkit Samples, in Ubuntu14.04.04 Guest, the below BUG is noticed.

  [    1.077464] NVRM: loading NVIDIA UNIX ppc64le Kernel Module  352.39  Fri Aug 14 17:10:41 PDT 2015
  [    1.211127] init: failsafe main process (498) killed by TERM signal
  [    1.307135] audit: type=1400 audit(1453110711.213:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=697 comm="apparmor_parser"
  [    1.307458] audit: type=1400 audit(1453110711.213:9): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/sbin/dhclient" pid=695 comm="apparmor_parser"
  [    1.307467] audit: type=1400 audit(1453110711.213:10): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=695 comm="apparmor_parser"
  [    1.560244] init: plymouth-upstart-bridge main process ended, respawning
  [ 1600.099759] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [cicc:15164]
  [ 1600.100531] Modules linked in: nvidia(POE) drm pseries_rng rtc_generic
  [ 1600.100611] CPU: 0 PID: 15164 Comm: cicc Tainted: P           OE   4.2.0-23-generic #28~14.04.1-Ubuntu
  [ 1600.100613] task: c00000000390d3e0 ti: c0000000039c8000 task.ti: c0000000039c8000
  [ 1600.100615] NIP: 00001000003eadc8 LR: 00001000003eb5e0 CTR: 00001000003eb5a0
  [ 1600.100622] REGS: c0000000039cbea0 TRAP: 0901   Tainted: P           OE    (4.2.0-23-generic)
  [ 1600.100623] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 22224848  XER: 00000000
  [ 1600.100646] CFAR: 00001000003eb5dc SOFTE: 1 
  [ 1600.100646] GPR00: 00001000003eb5e0 00003fffc0d01300 0000100000ed2838 000001000459e250 
  [ 1600.100646] GPR04: 0000000000000061 0000000000000000 0000000000000000 00000100098e3ae0 
  [ 1600.100646] GPR08: 000000000000090c 00000100045d8c00 000001000459e308 00000000000007ff 
  [ 1600.100646] GPR12: 0000000022224824 000010000006d750 
  [ 1600.100657] NIP [00001000003eadc8] 0x1000003eadc8
  [ 1600.100658] LR [00001000003eb5e0] 0x1000003eb5e0
  [ 1600.100659] Call Trace:
  [ 1672.099724] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 32s! [cicc:15182]
  [ 1672.100995] Modules linked in: nvidia(POE) drm pseries_rng rtc_generic
  [ 1672.101033] CPU: 4 PID: 15182 Comm: cicc Tainted: P           OEL  4.2.0-23-generic #28~14.04.1-Ubuntu
  [ 1672.101035] task: c00000009c5328a0 ti: c000000003060000 task.ti: c000000003060000
  [ 1672.101037] NIP: 0000100001266b0c LR: 0000100001266af4 CTR: 0000100001266a70
  [ 1672.101038] REGS: c000000003063ea0 TRAP: 0901   Tainted: P           OEL   (4.2.0-23-generic)
  [ 1672.101039] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 00000000
  [ 1672.101046] CFAR: 00001000012637b4 SOFTE: 1 
  [ 1672.101046] GPR00: 0000100001266af4 00003ffff7275860 0000100001392800 000001001e07e370 
  [ 1672.101046] GPR04: 0000000000000100 0000100001389fb0 00000000ffffff0c 000001001e07e360 
  [ 1672.101046] GPR08: 0000000000000051 0000000000000000 0000000000000001 000001001b64b230 
  [ 1672.101046] GPR12: 0000000000002200 000010000006d750 
  [ 1672.101055] NIP [0000100001266b0c] 0x100001266b0c
  [ 1672.101056] LR [0000100001266af4] 0x100001266af4
  [ 1672.101057] Call Trace:

  The nouveau modules are disabled, and nvidia drivers are loaded in the guest.
  # lsmod | grep nvidia
  nvidia              11484644  0 
  drm                   447733  2 nvidia

  After downloading the CUDA Developer tool-kit, during the compilation
  the BUG is noticed.

  This issue is intermittently occurring, at some instances this issues
  isn't noticed.

  == Guest Details ==
  # uname -a
  Linux ubuntu 4.2.0-23-generic #28~14.04.1-Ubuntu SMP Thu Dec 31 13:41:19 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

  # lspci -nn
  00:01.0 Ethernet controller [0200]: Red Hat, Inc Virtio network device [1af4:1000]
  00:02.0 USB controller [0c03]: Apple Inc. KeyLargo/Intrepid USB [106b:003f]
  00:03.0 SCSI storage controller [0100]: Red Hat, Inc Virtio block device [1af4:1001]
  00:04.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon [1af4:1002]
  00:05.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)

  # cat /proc/driver/nvidia/version

  NVRM version: NVIDIA UNIX ppc64le Kernel Module  352.39  Fri Aug 14 17:10:41 PDT 2015
  GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) 

  # nvcc -V
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2015 NVIDIA Corporation
  Built on Tue_Aug_11_14:31:50_CDT_2015
  Cuda compilation tools, release 7.5, V7.5.17

  == Host Details ==
  # uname -a
  Linux fr84p01.aus.stglabs.ibm.com 3.18.24-366.el7_1.pkvm3_1_0.4900.1.ppc64le #1 SMP Tue Jan 12 12:10:24 CST 2016 ppc64le ppc64le ppc64le GNU/Linux

  # cat /etc/issue
  IBM_PowerKVM release 3.1.0 service 1 build 49.0 (pkvm3_1_0)

  == Comment: #1 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> -
  2016-01-18 05:51:55 ==

  
  == Comment: #2 - MICHAEL BRINGMANN <mbringm@xxxxxxxxxx> - 2016-01-18 12:46:18 ==
  Is this possibly related to:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461620

  which points to patch:  stop_machine: Fix deadlock between multiple
  stop_two_cpus()

  == Comment: #3 - Scott E. Garfinkle <seg@xxxxxxxxxx> - 2016-01-22 11:06:41 ==
  So the working assumption here is that there's a problem in the non-OSS Nvidia drive running in the Ubuntu guestr. Leonardo will work with Brian to figure out how to move this forward (open bug with Nvidia or whatever).

  == Comment: #4 - Alistair Popple <apopple@xxxxxxxxxxx> - 2016-01-26 17:53:51 ==
  [ 1672.101056] LR [0000100001266af4] 0x100001266af4
  [ 1672.101057] Call Trace:

  Is there are a call trace? Also have you tried compiling the toolkit
  without the nvidia module loaded (ie. rmmod nvidia)? It don't think it
  should be needed just for compilation.

  == Comment: #6 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-01-29 05:13:02 ==
  Hi Alistair,

  There were Call Trace noticed during the compilation. 
  Later as suggested, when I compiled the tool-kit without the nvidia modules loaded, the soft-lockup wasn't noticed.

  Also, with the nvidia modules loaded, the soft-lockup is noticed
  intermittently.

  Thanks,
  Santwana

  == Comment: #7 - Alistair Popple <apopple@xxxxxxxxxxx> - 2016-02-01 18:57:53 ==
  Please attach a dmesg with the call trace.

  == Comment: #8 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-02-03 05:33:55 ==
  Hi Alistair,

  I have created attachment for the complete guest dmesg, however during the compilation the below is noticed :
  [ 2108.124722] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 31s! [cicc:15165]
  [ 2108.127839] Modules linked in: isofs nvidia(POE) drm pseries_rng rtc_generic
  [ 2108.127878] CPU: 7 PID: 15165 Comm: cicc Tainted: P           OE   4.2.0-23-generic #28~14.04.1-Ubuntu
  [ 2108.127880] task: c00000009bcc3f90 ti: c00000009bf7c000 task.ti: c00000009bf7c000
  [ 2108.127882] NIP: 00001000004b4810 LR: 00001000004b4810 CTR: 00001000004b45a0
  [ 2108.127883] REGS: c00000009bf7fea0 TRAP: 0901   Tainted: P           OE    (4.2.0-23-generic)
  [ 2108.127884] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 22224828  XER: 20000000
  [ 2108.127891] CFAR: 00001000004b45ac SOFTE: 1 
  [ 2108.127891] GPR00: 0000100000191910 00003fffcf7d4870 0000100000ed2838 0000000000000020 
  [ 2108.127891] GPR04: 000001004153f7c0 0000000000000000 0000000000000000 0000000000000020 
  [ 2108.127891] GPR08: 0000000000000020 0000000000000068 00001000004b45a0 0000010043079660 
  [ 2108.127891] GPR12: 0000000022224828 000010000006d750 
  [ 2108.127901] NIP [00001000004b4810] 0x1000004b4810
  [ 2108.127902] LR [00001000004b4810] 0x1000004b4810
  [ 2108.127903] Call Trace:

  There is no more data in the dmesg output.

  == Comment: #9 - Alistair Popple <apopple@xxxxxxxxxxx> - 2016-02-04 21:24:40 ==
  Santwana,

  Given it only occurs intermittently how many times have you tried to
  recreate it without the Nvidia module loaded? Can you please test it a
  number of times and see if you can recreate it without the Nvidia
  module loaded. I'm not sure this bug is Nvidia related, rather it may
  just be the work load causing us to hit some other bug (as suggested
  Michael). Thanks.

  - Alistair

  == Comment: #10 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-02-08 05:56:48 ==
  Hello Alistair,

  I tried few more attempts to recreate this issue without loading the Nvidia modules. 
  After 2-3 trials of compiling the CUDA with the Nvidia modules being unloaded, I am hitting the issue again.
  There is no Call Trace present in the dmesg of the guest. 
  Below is the log:
  [  121.586175] init: plymouth-stop pre-start process (1027) terminated with status 1
  [  337.428977] [drm] Module unloaded
  [ 1724.117110] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 49s! [cudafe++:14176]
  [ 1724.118811] Modules linked in: pseries_rng drm rtc_generic [last unloaded: nvidia]
  [ 1724.118837] CPU: 1 PID: 14176 Comm: cudafe++ Tainted: P           OE   4.2.0-27-generic #32~14.04.1-Ubuntu
  [ 1724.118840] task: c000000003f99290 ti: c000000095ff0000 task.ti: c000000095ff0000
  [ 1724.118842] NIP: 00000000101a6ae0 LR: 00000000101a6a90 CTR: 0000000000000000
  [ 1724.118843] REGS: c000000095ff3ea0 TRAP: 0901   Tainted: P           OE    (4.2.0-27-generic)
  [ 1724.118844] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 442428b2  XER: 20000000
  [ 1724.118851] CFAR: 00001000002395f0 SOFTE: 1 
  [ 1724.118851] GPR00: fffffeffd07e9de0 00003ffffbb8c2c0 00000000103e95d0 000001002f816220 
  [ 1724.118851] GPR04: 00003ffffbb8c410 00000000000000b0 0000000000000010 0000000000000004 
  [ 1724.118851] GPR08: 0000000000000004 0000000000000000 0000000000000001 000001002f8162d0 
  [ 1724.118851] GPR12: 0000000000000001 000010000006b400 
  [ 1724.118861] NIP [00000000101a6ae0] 0x101a6ae0
  [ 1724.118862] LR [00000000101a6a90] 0x101a6a90
  [ 1724.118863] Call Trace:

  
  == Comment: #13 - Sam Bobroff <sbobroff@xxxxxxxxxxx> - 2016-02-14 23:02:42 ==
  Hi Santwana,

  I've been unable to replicate this myself (I'm able to compile the 7.5
  toolkit samples without issue even with a K80 function passed through
  to the guest and the nvidia driver loaded), could you try to replicate
  the issue without the pass-through being set up at all? (And of course
  the nvidia driver not loaded in the guest as it won't have the
  hardware.)

  What exact version of Ubuntu are you using in the guest? What's the
  name of the install image you used and has it been updated (e.g. apt-
  get update/upgrade)? I couldn't find a 14.04.4 version, 14.04.3 seems
  to be the latest available (I'm using ubuntu-14.04.3-server-
  ppc64el.iso.)

  Also, when the soft lockup message appears, is the machine actually
  failing in any way? Does the guest lock up or is it just the message
  that is the problem?

  == Comment: #14 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-02-15 07:11:39 ==
  Hi Sam,

  I was able to replicate this issue even without K80 controller being pass-through to the guest. 
  During the CUDA compilation, the soft lockup is still reproducible, below are the details:

  [  122.996219] init: plymouth-stop pre-start process (1064) terminated with status 1
  [ 1480.023269] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 58s! [cicc:12677]
  [ 1480.024415] Modules linked in: pseries_rng rtc_generic
  [ 1480.024434] CPU: 1 PID: 12677 Comm: cicc Not tainted 4.2.0-27-generic #32~14.04.1-Ubuntu
  [ 1480.024437] task: c000000096b60000 ti: c000000096a40000 task.ti: c000000096a40000
  [ 1480.024438] NIP: 0000100001263c30 LR: 0000100001266af4 CTR: 0000100001266a70
  [ 1480.024440] REGS: c000000096a43ea0 TRAP: 0901   Not tainted  (4.2.0-27-generic)
  [ 1480.024441] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 24222248  XER: 00000000
  [ 1480.024447] CFAR: 0000100001263c44 SOFTE: 1 
  [ 1480.024447] GPR00: 0000100001266af4 00003fffe6f714a0 0000100001392800 0000100001389760 
  [ 1480.024447] GPR04: 0000000000000000 0000000000000000 0000000000000061 0000000000000000 
  [ 1480.024447] GPR08: 0000000000000c51 000001004852ad30 0000000000000d01 0000010048131478 
  [ 1480.024447] GPR12: 0000000000002200 000010000006d750 
  [ 1480.024456] NIP [0000100001263c30] 0x100001263c30
  [ 1480.024458] LR [0000100001266af4] 0x100001266af4
  [ 1480.024459] Call Trace:
  [ 1516.023209] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 66s! [cicc:12686]
  [ 1516.023515] Modules linked in: pseries_rng rtc_generic
  [ 1516.023522] CPU: 3 PID: 12686 Comm: cicc Tainted: G             L  4.2.0-27-generic #32~14.04.1-Ubuntu
  [ 1516.023525] task: c000000003439290 ti: c00000009c978000 task.ti: c00000009c978000
  [ 1516.023526] NIP: 00001000012803f8 LR: 00001000012296a4 CTR: 00001000012803c0
  [ 1516.023528] REGS: c00000009c97bea0 TRAP: 0901   Tainted: G             L   (4.2.0-27-generic)
  [ 1516.023529] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222284  XER: 20000000
  [ 1516.023535] CFAR: 00001000011f170c SOFTE: 1 
  [ 1516.023535] GPR00: 0000000000000000 00003fffc1933b60 0000100001392800 0000100000e290d0 
  [ 1516.023535] GPR04: 2525252525252525 0000000000000000 0000000000000000 00001000000667a0 
  [ 1516.023535] GPR08: 0000100000e290d0 00000000000000ff ffffffffff000000 0000010029fd102d 
  [ 1516.023535] GPR12: 0000000000646c25 000010000006d750 
  [ 1516.023545] NIP [00001000012803f8] 0x1000012803f8
  [ 1516.023546] LR [00001000012296a4] 0x1000012296a4
  [ 1516.023547] Call Trace:

  I am using Ubuntu14.04.4 guest and below are the details:
  # uname -a
  Linux ubuntu-new 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:31:44 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

  # cat /etc/os-release 
  NAME="Ubuntu"
  VERSION="14.04.3 LTS, Trusty Tahr"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 14.04.3 LTS"
  VERSION_ID="14.04"

  The guest was installed using ISO downloaded from
  http://cdimage.ubuntu.com/ubuntu-server/trusty/daily/current/trusty-
  server-ppc64el.iso

  The guest was also updated to the latest using apt-get update. 
  When the soft lock-up appears, the guest isn't failing, after the compilation is over we are noticing this in the kernel messages.
  Mostly in the below stage of compilation, the soft lockup is noticed.
  mkdir -p ../../bin/ppc64le/linux/release
  cp fastWalshTransform ../../bin/ppc64le/linux/release
  make[1]: Leaving directory `/NVIDIA_CUDA-7.5_Samples/6_Advanced/fastWalshTransform'
  make[1]: Entering directory `/NVIDIA_CUDA-7.5_Samples/6_Advanced/segmentationTreeThrust'
  /usr/local/cuda-7.5/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 -o segmentationTree.o -c segmentationTree.cu
  [ 1276.122187] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 41s! [cicc:12414]

  Thanks,
  Santwana

  == Comment: #15 - SUDIPTO GHOSH <sudiptoghosh@xxxxxxxxxx> - 2016-02-16 03:23:34 ==
  Based on the last 2 comments, it seems that the issue exists with Ubuntu 14.04.04 (as per Santwana's tests) & does not exist in 14.04.03 (per Sam's test).

  == Comment: #16 - Scott E. Garfinkle <seg@xxxxxxxxxx> - 2016-02-18 11:25:44 ==
  Moved over to BugsAgainstDistro since this does not seem to be a problem with the host kernel.

  == Comment: #17 - SUDIPTO GHOSH <sudiptoghosh@xxxxxxxxxx> - 2016-02-23 05:36:19 ==
  @Santwana - Could you try again with the supported CUDA stack (7.5.23) provided by Nvidia.

  == Comment: #18 - SANTWANA SAMANTRAY <santwana.samantray@xxxxxxxxxx> - 2016-02-24 05:33:47 ==
  This issue is still reproducible with the CUDA Stack - 7.5.23 and Nvidia Driver - 352.68. 
  The Bug is noticed during compilation of the CUDA Toolkit.

  # cat /proc/driver/nvidia/version 
  NVRM version: NVIDIA UNIX ppc64le Kernel Module  352.68  Tue Dec  1 16:32:13 PST 2015
  GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1) 

  # dpkg -l cuda
  Desired=Unknown/Install/Remove/Purge/Hold
  | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
  |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
  ||/ Name           Version      Architecture Description
  +++-==============-============-============-=================================
  ii  cuda           7.5-23       ppc64el      CUDA meta-package

  [    1.497034] init: plymouth-upstart-bridge main process ended, respawning
  [ 2340.096583] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 42s! [cicc:20527]
  [ 2340.097850] Modules linked in: nvidia(POE) pseries_rng drm rtc_generic
  [ 2340.097930] CPU: 1 PID: 20527 Comm: cicc Tainted: P           OE   4.2.0-27-generic #32~14.04.1-Ubuntu
  [ 2340.097933] task: c000000003e00000 ti: c000000003efc000 task.ti: c000000003efc000
  [ 2340.097934] NIP: 00001000012676e0 LR: 0000100000d4a8d0 CTR: 0000100001266f40
  [ 2340.097942] REGS: c000000003effea0 TRAP: 0901   Tainted: P           OE    (4.2.0-27-generic)
  [ 2340.097943] MSR: 800000010280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 24224828  XER: 00000000
  [ 2340.097965] CFAR: 000010000126719c SOFTE: 1 
  [ 2340.097965] GPR00: 0000100000d4a8d0 00003ffffe8e67c0 0000100001392800 0000010023a433e0 
  [ 2340.097965] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 
  [ 2340.097965] GPR08: 00001000013897b8 0000000000000000 0000000000000001 0000010023a433f0 
  [ 2340.097965] GPR12: 0000100001266f40 000010000006d750 
  [ 2340.097975] NIP [00001000012676e0] 0x1000012676e0
  [ 2340.097976] LR [0000100000d4a8d0] 0x100000d4a8d0
  [ 2340.097977] Call Trace:

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1557379/+subscriptions