kernel-packages team mailing list archive

Thread
Date
[Bug 1483343] Re: NMI watchdog: BUG: soft lockup errors when we execute lock_torture_wr tests

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: t2d <1483343@xxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Apr 2016 13:53:20 -0000
Reply-to: Bug 1483343 <1483343@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
It seems we have the same problem with latest LTS kernel

# uname -a
Linux dc01ram1rls 4.4.0-18-generic #34~14.04.1-Ubuntu SMP Thu Apr 7 18:31:54 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.4 LTS
Release:	14.04
Codename:	trusty

The kernel was installed with 
# apt-get install --install-recommends linux-generic-lts-xenial

Errors in /var/log/kern.log look like:

Apr 24 03:31:58 dc01ram1rls kernel: [280174.661115] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [rsync:10
4799]
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669494] Modules linked in: ip6table_filter ip6_tables iptable_filter ip
_tables ebtable_nat ebtables x_tables 8021q garp mrp bridge stp llc dm_crypt intel_rapl ipmi_ssif x86_pkg_temp_ther
mal intel_powerclamp ipmi_devintf coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x8
6_64 lrw gf128mul glue_helper ablk_helper sb_edac cryptd dcdbas edac_core input_leds mei_me lpc_ich mei ipmi_si 825
0_fintek shpchp ipmi_msghandler acpi_power_meter mac_hid parport_pc ppdev lp parport igb dca ptp hid_generic usbhid
 uas hid usb_storage ahci pps_core megaraid_sas i2c_algo_bit libahci wmi fjes
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669522] CPU: 4 PID: 104799 Comm: rsync Not tainted 4.4.0-18-generic #34
~14.04.1-Ubuntu
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669523] Hardware name: Dell Inc. PowerEdge T630/0W9WXC, BIOS 1.3.6 06/0
8/2015
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669525] task: ffff88041ebea940 ti: ffff880402580000 task.ti: ffff880402
580000
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669526] RIP: 0010:[<ffffffff810c4e50>]  [<ffffffff810c4e50>] native_que
ued_spin_lock_slowpath+0x160/0x170
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669531] RSP: 0018:ffff8804025839f8  EFLAGS: 00000202
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669532] RAX: 0000000000000101 RBX: 0000000000000000 RCX: 00000000000000
01
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669533] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffff81fce3
20
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669533] RBP: ffff8804025839f8 R08: 0000000000000101 R09: ffff880f9da4b7
a4
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669534] R10: ffff880855937000 R11: 000000000000008c R12: 00000000000000
00
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669535] R13: ffff880f9e2799f0 R14: ffff880f9e2799c0 R15: 00000000000000
00
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669536] FS:  00007f0980b0b740(0000) GS:ffff88085ec80000(0000) knlGS:000
0000000000000
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669537] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669538] CR2: 0000000002c0dd08 CR3: 0000000405421000 CR4: 00000000001426
e0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669539] Stack:
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669539]  ffff880402583a08 ffffffff81180407 ffff880402583a18 ffffffff817
ee550
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669541]  ffff880402583a80 ffffffff8125ad4a 0000000000000000 ffff8804025
83a68
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669542]  ffffffff812332ab ffff880f9c044140 ffff880f9e2761a0 ffff8804025
83a70
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669544] Call Trace:
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669547]  [<ffffffff81180407>] queued_spin_lock_slowpath+0xb/0xf
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669550]  [<ffffffff817ee550>] _raw_spin_lock+0x20/0x30
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669552]  [<ffffffff8125ad4a>] mb_cache_entry_get+0x1aa/0x220
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669555]  [<ffffffff812332ab>] ? __getblk_gfp+0x2b/0x60
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669557]  [<ffffffff812c8efc>] ext4_xattr_block_set+0x7c/0x9d0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669559]  [<ffffffff812c84e4>] ? ext4_xattr_set_entry+0x34/0x340
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669560]  [<ffffffff812ca3e2>] ext4_xattr_set_handle+0x2f2/0x420
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669562]  [<ffffffff812cf6d0>] __ext4_set_acl+0x280/0x320
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669564]  [<ffffffff812cfb42>] ext4_set_acl+0xd2/0x110
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669565]  [<ffffffff8125c0ad>] ? posix_acl_from_xattr+0x11d/0x170
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669567]  [<ffffffff8125c1b7>] posix_acl_xattr_set+0xb7/0x150
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669569]  [<ffffffff81221321>] generic_setxattr+0x61/0x80
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669570]  [<ffffffff81221e01>] __vfs_setxattr_noperm+0x61/0x1a0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669573]  [<ffffffff81326c9d>] ? security_inode_setxattr+0xbd/0xd0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669575]  [<ffffffff81221fe7>] vfs_setxattr+0xa7/0xb0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669576]  [<ffffffff81222124>] setxattr+0x134/0x1c0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669579]  [<ffffffff811db97f>] ? kmem_cache_alloc+0x19f/0x200
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669582]  [<ffffffff8120c1af>] ? getname_flags+0x4f/0x1f0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669583]  [<ffffffff810c46ff>] ? percpu_down_read+0x1f/0x50
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669584]  [<ffffffff8122223c>] path_setxattr+0x8c/0xc0
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669586]  [<ffffffff812222f4>] SyS_setxattr+0x14/0x20
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669587]  [<ffffffff817ee8f6>] entry_SYSCALL_64_fastpath+0x16/0x75
Apr 24 03:31:58 dc01ram1rls kernel: [280174.669588] Code: 8b 01 48 85 c0 75 0a f3 90 48 8b 01 48 85 c0 74 f6 c7 40 08 01 00 00 00 e9 61 ff ff ff 83 fa 01 75 07 e9 c2 fe ff ff f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 66 90 0f 1f 44 00 00

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1483343

Title:
  NMI watchdog: BUG: soft lockup errors when we execute lock_torture_wr
  tests

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Vivid:
  In Progress

Bug description:
  ---Problem Description---
  NMI watchdog: BUG: soft lockup errors when we execute lock_torture_wr tests
    
  ---uname output---
  Linux alp15 3.19.0-18-generic #18~14.04.1-Ubuntu SMP Wed May 20 09:40:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = P8 
    
  ---Steps to Reproduce---
  Install a P8 Power VM LPAR with Ubuntu 14.04.2 ISO.
  Then install the Ubuntu 14.04.3 kernel on the same and reboot.
  Then compile and build the LTP latest test suites on the same.

  root@alp15:~# tar -xvf ltp-full-20150420.tar.bz2
  root@alp15:~# cd ltp-full-20150420/
  root@alp15:~/ltp-full-20150420# ls
  aclocal.m4      configure     execltp.in  install-sh  Makefile          README                runltplite.sh    testcases    utils
  autom4te.cache  configure.ac  IDcheck.sh  lib         Makefile.release  README.kernel_config  runtest          testscripts  ver_linux
  config.guess    COPYING       include     ltpmenu     missing           runalltests.sh        scenario_groups  TODO         VERSION
  config.sub      doc           INSTALL     m4          pan               runltp                scripts          tools
  root@alp15:~/ltp-full-20150420# ./configure
  root@alp15:~/ltp-full-20150420# make
  root@alp15:~/ltp-full-20150420# make install

  root@alp15:/opt/ltp/testcases/bin# ./lock_torture.sh
  lock_torture 1 TINFO : estimate time 6.00 min
  lock_torture 1 TINFO : spin_lock: running 60 sec...

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034386] NMI watchdog: BUG: soft lockup - CPU#10 stuck for 21s! [lock_torture_wr:2337]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034389] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lock_torture_wr:2331]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034394] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [lock_torture_wr:2339]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034396] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lock_torture_wr:2346]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034398] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [lock_torture_wr:2334]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034410] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [lock_torture_wr:2321]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034412] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [lock_torture_wr:2333]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.038386] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [lock_torture_wr:2327]

   
  Stack trace output:
   root@alp15:~# dmesg | more
  [ 1717.146881] lock_torture_wr R  running task
  [ 1717.146881]
  [ 1717.146885]     0  2555      2 0x00000804
  [ 1717.146887] Call Trace:
  [ 1717.146894] [c000000c7551b820] [c000000c7551b860] 0xc000000c7551b860 (unreliable)
  [ 1717.146899] [c000000c7551b860] [c0000000000b4fb0] __do_softirq+0x220/0x3b0
  [ 1717.146904] [c000000c7551b960] [c0000000000b5478] irq_exit+0x98/0x100
  [ 1717.146909] [c000000c7551b980] [c00000000001fa54] timer_interrupt+0xa4/0xe0
  [ 1717.146913] [c000000c7551b9b0] [c000000000002758] decrementer_common+0x158/0x180
  [ 1717.146922] --- interrupt: 901 at _raw_write_lock+0x68/0xc0
  [ 1717.146922]     LR = torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146927] [c000000c7551bca0] [c000000c7551bcd0] 0xc000000c7551bcd0 (unreliable)
  [ 1717.146934] [c000000c7551bcd0] [d00000000d4810b8] torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146939] [c000000c7551bcf0] [d00000000d480578] lock_torture_writer+0x98/0x210 [locktorture]
  [ 1717.146944] [c000000c7551bd80] [c0000000000da4d4] kthread+0x114/0x140
  [ 1717.146948] [c000000c7551be30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
  [ 1717.146951] Task dump for CPU 10:
  [ 1717.146953] lock_torture_wr R  running task        0  2537      2 0x00000804
  [ 1717.146957] Call Trace:
  [ 1717.146961] [c000000c7557b820] [c000000c7557b860] 0xc000000c7557b860 (unreliable)
  [ 1717.146966] [c000000c7557b860] [c0000000000b4fb0] __do_softirq+0x220/0x3b0
  [ 1717.146970] [c000000c7557b960] [c0000000000b5478] irq_exit+0x98/0x100
  [ 1717.146975] [c000000c7557b980] [c00000000001fa54] timer_interrupt+0xa4/0xe0
  [ 1717.146979] [c000000c7557b9b0] [c000000000002758] decrementer_common+0x158/0x180
  [ 1717.146988] --- interrupt: 901 at _raw_write_lock+0x68/0xc0
  [ 1717.146988]     LR = torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146993] [c000000c7557bca0] [c000000c7557bcd0] 0xc000000c7557bcd0 (unreliable)
  [ 1717.147000] [c000000c7557bcd0] [d00000000d4810b8] torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.147006] [c000000c7557bcf0] [d00000000d480578] lock_torture_writer+0x98/0x210 [locktorture]
  [ 1717.147013] [c000000c7557bd80] [c0000000000da4d4] kthread+0x114/0x140
  [ 1717.147017] [c000000c7557be30] [c00000000000956c] ret_from_kernel_thread+0x5c/0x70
  [ 1717.147020] Task dump for CPU 17:
  [ 1717.147021] Task dump for CPU 2:
  [ 1717.147028] lock_torture_wr R
  [ 1717.147028] lock_torture_wr R  running task
  [ 1717.147033]   running task        0  2547      2 0x00000804
  [ 1717.147042]     0  2533      2 0x00000804
  [ 1717.147044] Call Trace:
  [ 1717.147045] Call Trace:
  [ 1717.147053] [c000000c732a3820] [c000000c7f688448] 0xc000000c7f688448
  [ 1717.147056] [c000000c7555f820] [c000000c7fa48448] 0xc000000c7fa48448
  [ 1717.147059]  (unreliable)
  [ 1717.147063]  (unreliable)
  [ 1717.147063]
  [ 1717.147067]
  [ 1717.147072] Task dump for CPU 18:
  [ 1717.147073] Task dump for CPU 7:
  [ 1717.147077] lock_torture_wr R  running task
  [ 1717.147082] lock_torture_wr R    0  2555      2 0x00000804
  [ 1717.147088]   running task
  [ 1717.147088] Call Trace:
  [ 1717.147096] [c000000c7551b820] [c000000c7551b860] 0xc000000c7551b860
  [ 1717.147096]     0  2559      2 0x00000804
  [ 1717.147102] Call Trace:
  [ 1717.147105]  (unreliable)

  It is possible that we are missing this commit that fixes a deadlock
  during these tests:

  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=f548d99ef4f5ec8f7080e88ad07c44d16d058ddc

  will check the Ubuntu source shortly as see if this is the case and we
  can suggest building a kernel to see if it helps.

  The apt-get source linux-image- on the test system didn't pull down
  the sources but the kernel being used is close to the one used for
  vivid (3.19.0-25.26) so I pulled down the git source tree for it with
  git clone git://kernel.ubuntu.com/ubuntu/ubuntu-vivid.git and the
  resulting source shows that the patch for the commit mentioned is not
  applied.

  As I basically understand it, the problem that was fixed is that while
  torture_rwlock_read_lock_irq() acquires a read lock on the lock
  called:

  torture_rwlock

  anything that calls the counterpart torture_rwlock_read_unlock_irq()
  to relinquish the read lock instead ends doing a
  write_unlock_irqrestore() on the torture_rwlock() in essence leaving
  the read lock. So when the locktorture module calls something like
  torture_rwlock_write_lock() as we see in the bug description, it will
  block indefinitely as there is at least one lock reader.

  I'll go ahead and mirror this since I pretty confident this is the
  issue (also should affect Vivid).

  We'll have to figure out how to get the sources for the LTS kernel to
  build a test kernel as well.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1483343/+subscriptions