group.of.nepali.translators team mailing list archive
-
group.of.nepali.translators team
-
Mailing list archive
-
Message #29161
[Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers
[X][PATCH 0/4] LP#1821259 Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099427.html
[B][PATCH 0/2] Fix for LP#1821259 (pending patches for) Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099432.html
** Also affects: linux (Ubuntu Bionic)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Xenial)
Importance: Undecided
Status: New
** No longer affects: linux (Ubuntu)
** Changed in: linux (Ubuntu Bionic)
Status: New => Confirmed
** Changed in: linux (Ubuntu Xenial)
Status: New => Confirmed
--
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1821259
Title:
Hard lockup in 2 CPUs due to deadlock in cpu_stoppers
Status in linux source package in Xenial:
Confirmed
Status in linux source package in Bionic:
Confirmed
Bug description:
[Impact]
* This problem hard locks up 2 CPUs in a deadlock, and this
soft locks up other CPUs as an effect; the system becomes
unusable.
* This is relatively rare / difficult to hit because it's a
corner case in scheduling/load balancing that needs timing
with CPU stopper code. And it needs SMP plus _NUMA_ system.
(but it can be hit with synthetic test case attached in LP.)
* Since SMP plus NUMA usually equals _servers_ it looks like
a good idea to prevent this bug / hard lockups / rebooting.
* The fix resolves the potential deadlock by removing one of
the calls required to deadlock from under the locked code.
[Test Case]
* There's a synthetic test case to reproduce this problem
(although without the stack traces - just a system hang)
attached to this LP bug.
* It uses kprobes/mdelay/cpu stopper calls to force the code
to execute and force the timing/locking condition to occur.
* $ sudo insmod kmod-stopper.ko
Some dmesg logging occurs, and systems either hangs or not.
See examples in comments.
[Regression Potential]
* These are patches to the cpu stop_machine.c code, and they
change a bit how it works; however, there are no upstream
fixes for these patches anymore and they are still the top
of the 'git log --oneline -- kernel/stop_machine.c' output.
* These patches have been verified with the synthetic test case
and 'stress-ng --class scheduler --sequential 0' (no regressions)
on guest with 2 CPUs and one physical system with 24 CPUs.
[Other Info]
* The patches are required on Xenial and later.
* There are 4 patches for Xenial, and 2 patches pending for Bionic.
* All patches are applied from Cosmic onwards.
[Original Description]
These 2 hard lockups happened all of a sudden in the logs, and many
soft lockups occur after them as a fallout.
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: <...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 0010:[<ffffffff810c8cb0>] [<ffffffff810c8cb0>] native_queued_spin_lock_slowpath+0x160/0x170
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 0000:ffff883ff2113c58 EFLAGS: 00000002
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS: 0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092] ffff883ff2113c68 ffffffff811870eb ffff883ff2113c80 ffffffff81819907
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094] ffff881fff991ba0 ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096] ffff881fff997c90 ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105] [<ffffffff811870eb>] queued_spin_lock_slowpath+0xb/0xf
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109] [<ffffffff81819907>] _raw_spin_lock_irqsave+0x37/0x40
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113] [<ffffffff8111c600>] cpu_stop_queue_work+0x30/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116] [<ffffffff8111ccd0>] stop_one_cpu_nowait+0x30/0x40
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119] [<ffffffff810bbb5b>] load_balance+0x71b/0x940
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122] [<ffffffff810bbff5>] pick_next_task_fair+0x275/0x4b0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126] [<ffffffff81816166>] __schedule+0x6c6/0x7f0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132] [<ffffffff810a2560>] ? sort_range+0x30/0x30
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134] [<ffffffff818162c5>] schedule+0x35/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136] [<ffffffff810a262d>] smpboot_thread_fn+0xcd/0x180
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139] [<ffffffff8109f138>] kthread+0xd8/0xf0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141] [<ffffffff8109f060>] ? kthread_park+0x60/0x60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143] [<ffffffff81819ff5>] ret_from_fork+0x55/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144] [<ffffffff8109f060>] ? kthread_park+0x60/0x60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: <...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 0010:[<ffffffff810aacb6>] [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 0000:ffff883cda77fa78 EFLAGS: 00000002
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000017300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS: 00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365] 0000000000000202 ffff883cda77fa98 0000000000000003 0000000000000006
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368] 000000000000000a ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370] ffff883cda77fac8 ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375] [<ffffffff810ab035>] wake_up_process+0x15/0x20
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379] [<ffffffff8111cc22>] stop_two_cpus+0x1b2/0x230
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382] [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384] [<ffffffff810b5d15>] ? dequeue_entity+0x455/0x8a0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386] [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388] [<ffffffff810aaa70>] ? __migrate_swap_task.part.83+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390] [<ffffffff810ab18e>] migrate_swap+0xae/0x130
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392] [<ffffffff810b4e44>] task_numa_migrate+0x504/0x930
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394] [<ffffffff810b52e9>] numa_migrate_preferred+0x79/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396] [<ffffffff810b9373>] task_numa_fault+0x923/0xcd0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400] [<ffffffff8175e407>] ? tcp_recvmsg+0x6b7/0xbd0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404] [<ffffffff811da9be>] ? mpol_misplaced+0x14e/0x190
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408] [<ffffffff811b7836>] handle_pte_fault+0x5a6/0x1440
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411] [<ffffffff816f6693>] ? sock_recvmsg+0x43/0x50
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413] [<ffffffff811b9540>] handle_mm_fault+0x250/0x540
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417] [<ffffffff81069e1a>] __do_page_fault+0x19a/0x430
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419] [<ffffffff8106a0d2>] do_page_fault+0x22/0x30
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423] [<ffffffff8181c5a8>] page_fault+0x28/0x30
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/xenial/+source/linux/+bug/1821259/+subscriptions