← Back to team overview

group.of.nepali.translators team mailing list archive

[Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

 

[X][PATCH 0/4] LP#1821259 Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099427.html

[B][PATCH 0/2] Fix for LP#1821259 (pending patches for) Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099432.html

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** No longer affects: linux (Ubuntu)

** Changed in: linux (Ubuntu Bionic)
       Status: New => Confirmed

** Changed in: linux (Ubuntu Xenial)
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux source package in Xenial:
  Confirmed
Status in linux source package in Bionic:
  Confirmed

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
     soft locks up other CPUs as an effect; the system becomes
     unusable.

   * This is relatively rare / difficult to hit because it's a
     corner case in scheduling/load balancing that needs timing
     with CPU stopper code. And it needs SMP plus _NUMA_ system.
     (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
     a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
     the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
     (although without the stack traces - just a system hang)
     attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
     to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

     Some dmesg logging occurs, and systems either hangs or not.
     See examples in comments.
     
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
     change a bit how it works;  however, there are no upstream
     fixes for these patches anymore and they are still the top
     of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
     and 'stress-ng --class scheduler --sequential 0' (no regressions)
     on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: <...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 0010:[<ffffffff810c8cb0>]  [<ffffffff810c8cb0>] native_queued_spin_lock_slowpath+0x160/0x170
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 0000:ffff883ff2113c58  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  ffff883ff2113c68 ffffffff811870eb ffff883ff2113c80 ffffffff81819907
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  ffff881fff991ba0 ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  ffff881fff997c90 ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  [<ffffffff811870eb>] queued_spin_lock_slowpath+0xb/0xf
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  [<ffffffff81819907>] _raw_spin_lock_irqsave+0x37/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  [<ffffffff8111c600>] cpu_stop_queue_work+0x30/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116]  [<ffffffff8111ccd0>] stop_one_cpu_nowait+0x30/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119]  [<ffffffff810bbb5b>] load_balance+0x71b/0x940
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122]  [<ffffffff810bbff5>] pick_next_task_fair+0x275/0x4b0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126]  [<ffffffff81816166>] __schedule+0x6c6/0x7f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132]  [<ffffffff810a2560>] ? sort_range+0x30/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134]  [<ffffffff818162c5>] schedule+0x35/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136]  [<ffffffff810a262d>] smpboot_thread_fn+0xcd/0x180
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139]  [<ffffffff8109f138>] kthread+0xd8/0xf0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141]  [<ffffffff8109f060>] ? kthread_park+0x60/0x60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143]  [<ffffffff81819ff5>] ret_from_fork+0x55/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144]  [<ffffffff8109f060>] ? kthread_park+0x60/0x60

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: <...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 0010:[<ffffffff810aacb6>]  [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 0000:ffff883cda77fa78  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000017300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS:  00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365]  0000000000000202 ffff883cda77fa98 0000000000000003 0000000000000006
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368]  000000000000000a ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370]  ffff883cda77fac8 ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375]  [<ffffffff810ab035>] wake_up_process+0x15/0x20
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379]  [<ffffffff8111cc22>] stop_two_cpus+0x1b2/0x230
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382]  [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384]  [<ffffffff810b5d15>] ? dequeue_entity+0x455/0x8a0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386]  [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388]  [<ffffffff810aaa70>] ? __migrate_swap_task.part.83+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390]  [<ffffffff810ab18e>] migrate_swap+0xae/0x130
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392]  [<ffffffff810b4e44>] task_numa_migrate+0x504/0x930
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394]  [<ffffffff810b52e9>] numa_migrate_preferred+0x79/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396]  [<ffffffff810b9373>] task_numa_fault+0x923/0xcd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400]  [<ffffffff8175e407>] ? tcp_recvmsg+0x6b7/0xbd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404]  [<ffffffff811da9be>] ? mpol_misplaced+0x14e/0x190
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408]  [<ffffffff811b7836>] handle_pte_fault+0x5a6/0x1440
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411]  [<ffffffff816f6693>] ? sock_recvmsg+0x43/0x50
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413]  [<ffffffff811b9540>] handle_mm_fault+0x250/0x540
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417]  [<ffffffff81069e1a>] __do_page_fault+0x19a/0x430
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419]  [<ffffffff8106a0d2>] do_page_fault+0x22/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423]  [<ffffffff8181c5a8>] page_fault+0x28/0x30

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/xenial/+source/linux/+bug/1821259/+subscriptions