kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #127974
[Bug 1461620] Re: NUMA task migration race condition due to stop task not being checked when balancing happens
** Description changed:
+ SRU Justification:
+
+ Impact:
+ - Deadlock when migrating processes in between NUMA domains.
+ - Came with 1 kernel dump given to me.
+ - Hard to trigger.
+
+ Fix:
+ - Upstream development after upstream discussion.
+ - Discussion: https://lkml.org/lkml/2015/6/15/531
+
+ Testcase:
+ - Stress test in a virtual NUMA environment
+ - Wait indefinitely... Hard to trigger
+ - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461620/comments/8
+ - Can, at least, make sure the logic did not introduce regression
+
+ ----
+
It was brought to my attention the follow kernel panic:
"""
- [3367068.076488] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f
- [3367068.092735] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:153]
- [3367068.100368] Modules linked in: iptable_raw xt_nat xt_REDIRECT veth openvswitch(OF) gre vxlan ip_tunnel libcrc32c dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ipmi_devintf 8021q garp stp mrp llc bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel gpio_ich joydev aes_x86_64 lrw gf128mul glue_helper ablk_helper ipmi_si cryptd sb_edac wmi lpc_ich edac_core mac_hid acpi_power_meter nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache lp parport hid_generic ixgbe fnic libfcoe dca ptp
- [3367068.100409] libfc megaraid_sas pps_core mdio usbhid scsi_transport_fc hid enic scsi_tgt
- [3367068.100415] CPU: 16 PID: 153 Comm: migration/16 Tainted: GF O 3.13.0-34-generic #60-Ubuntu
- [3367068.100417] Hardware name: Cisco Systems Inc UCSC-C220-M3S/UCSC-C220-M3S, BIOS C220M3.1.5.4f.0.111320130449 11/13/2013
- [3367068.100419] task: ffff881fd2f517f0 ti: ffff881fd2f1c000 task.ti: ffff881fd2f1c000
- [3367068.100420] RIP: 0010:[<ffffffff810f5944>] [<ffffffff810f5944>] multi_cpu_stop+0x64/0xf0
- [3367068.100426] RSP: 0000:ffff881fd2f1dd98 EFLAGS: 00000246
- [3367068.100427] RAX: ffffffff8180af40 RBX: 0000000000000086 RCX: 000000000000a402
- [3367068.100428] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff883e607edb48
- [3367068.100430] RBP: ffff881fd2f1ddb8 R08: 0000000000000282 R09: 0000000000000001
- [3367068.100431] R10: 000000000000b6d8 R11: ffff881fc374dc80 R12: 0000000000014440
- [3367068.100432] R13: ffff881fd291ae00 R14: ffff881fd291ae08 R15: 0000000200000010
- [3367068.100433] FS: 0000000000000000(0000) GS:ffff881fffd00000(0000) knlGS:0000000000000000
- [3367068.100434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
- [3367068.100435] CR2: 00007f6202134b98 CR3: 0000000001c0e000 CR4: 00000000001407e0
- [3367068.100437] Stack:
- [3367068.100438] ffff883e607edb70 ffff881fffd0ede0 ffff881fffd0ede8 ffff883e607edb48
- [3367068.100441] ffff881fd2f1de78 ffffffff810f5b5e ffffffff8109dfc4 ffff881fffd14440
- [3367068.100443] ffff881fd2f1de08 ffffffff81097508 0000000000000000 ffff881fffd14440
- [3367068.100446] Call Trace:
- [3367068.100450] [<ffffffff810f5b5e>] cpu_stopper_thread+0x7e/0x150
- [3367068.100454] [<ffffffff8109dfc4>] ? vtime_common_task_switch+0x24/0x40
- [3367068.100458] [<ffffffff81097508>] ? finish_task_switch+0x128/0x170
- [3367068.100462] [<ffffffff8171fd41>] ? __schedule+0x381/0x7d0
- [3367068.100465] [<ffffffff810926af>] smpboot_thread_fn+0xff/0x1b0
- [3367068.100467] [<ffffffff810925b0>] ? SyS_setgroups+0x1a0/0x1a0
- [3367068.100470] [<ffffffff8108b3d2>] kthread+0xd2/0xf0
- [3367068.100473] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
- [3367068.100477] [<ffffffff8172c6bc>] ret_from_fork+0x7c/0xb0
- [3367068.100479] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
+ [3367068.076488] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f
+ [3367068.092735] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:153]
+ [3367068.100368] Modules linked in: iptable_raw xt_nat xt_REDIRECT veth openvswitch(OF) gre vxlan ip_tunnel libcrc32c dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ipmi_devintf 8021q garp stp mrp llc bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel gpio_ich joydev aes_x86_64 lrw gf128mul glue_helper ablk_helper ipmi_si cryptd sb_edac wmi lpc_ich edac_core mac_hid acpi_power_meter nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache lp parport hid_generic ixgbe fnic libfcoe dca ptp
+ [3367068.100409] libfc megaraid_sas pps_core mdio usbhid scsi_transport_fc hid enic scsi_tgt
+ [3367068.100415] CPU: 16 PID: 153 Comm: migration/16 Tainted: GF O 3.13.0-34-generic #60-Ubuntu
+ [3367068.100417] Hardware name: Cisco Systems Inc UCSC-C220-M3S/UCSC-C220-M3S, BIOS C220M3.1.5.4f.0.111320130449 11/13/2013
+ [3367068.100419] task: ffff881fd2f517f0 ti: ffff881fd2f1c000 task.ti: ffff881fd2f1c000
+ [3367068.100420] RIP: 0010:[<ffffffff810f5944>] [<ffffffff810f5944>] multi_cpu_stop+0x64/0xf0
+ [3367068.100426] RSP: 0000:ffff881fd2f1dd98 EFLAGS: 00000246
+ [3367068.100427] RAX: ffffffff8180af40 RBX: 0000000000000086 RCX: 000000000000a402
+ [3367068.100428] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff883e607edb48
+ [3367068.100430] RBP: ffff881fd2f1ddb8 R08: 0000000000000282 R09: 0000000000000001
+ [3367068.100431] R10: 000000000000b6d8 R11: ffff881fc374dc80 R12: 0000000000014440
+ [3367068.100432] R13: ffff881fd291ae00 R14: ffff881fd291ae08 R15: 0000000200000010
+ [3367068.100433] FS: 0000000000000000(0000) GS:ffff881fffd00000(0000) knlGS:0000000000000000
+ [3367068.100434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+ [3367068.100435] CR2: 00007f6202134b98 CR3: 0000000001c0e000 CR4: 00000000001407e0
+ [3367068.100437] Stack:
+ [3367068.100438] ffff883e607edb70 ffff881fffd0ede0 ffff881fffd0ede8 ffff883e607edb48
+ [3367068.100441] ffff881fd2f1de78 ffffffff810f5b5e ffffffff8109dfc4 ffff881fffd14440
+ [3367068.100443] ffff881fd2f1de08 ffffffff81097508 0000000000000000 ffff881fffd14440
+ [3367068.100446] Call Trace:
+ [3367068.100450] [<ffffffff810f5b5e>] cpu_stopper_thread+0x7e/0x150
+ [3367068.100454] [<ffffffff8109dfc4>] ? vtime_common_task_switch+0x24/0x40
+ [3367068.100458] [<ffffffff81097508>] ? finish_task_switch+0x128/0x170
+ [3367068.100462] [<ffffffff8171fd41>] ? __schedule+0x381/0x7d0
+ [3367068.100465] [<ffffffff810926af>] smpboot_thread_fn+0xff/0x1b0
+ [3367068.100467] [<ffffffff810925b0>] ? SyS_setgroups+0x1a0/0x1a0
+ [3367068.100470] [<ffffffff8108b3d2>] kthread+0xd2/0xf0
+ [3367068.100473] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
+ [3367068.100477] [<ffffffff8172c6bc>] ret_from_fork+0x7c/0xb0
+ [3367068.100479] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[3367068.100480] Code: db 85 db 41 0f 95 c5 31 f6 31 d2 eb 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da <83> fa 04 74 3d f3 90 41 8b 5c 24 20 39 d3 74 f0 83 fb 02 75 d7
"""
I'm explaining WHY this is happening in the first comments and HOW to
fix it.
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1461620
Title:
NUMA task migration race condition due to stop task not being checked
when balancing happens
Status in linux package in Ubuntu:
Invalid
Status in linux source package in Trusty:
In Progress
Status in linux source package in Vivid:
In Progress
Bug description:
SRU Justification:
Impact:
- Deadlock when migrating processes in between NUMA domains.
- Came with 1 kernel dump given to me.
- Hard to trigger.
Fix:
- Upstream development after upstream discussion.
- Discussion: https://lkml.org/lkml/2015/6/15/531
Testcase:
- Stress test in a virtual NUMA environment
- Wait indefinitely... Hard to trigger
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461620/comments/8
- Can, at least, make sure the logic did not introduce regression
----
It was brought to my attention the follow kernel panic:
"""
[3367068.076488] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f
[3367068.092735] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:153]
[3367068.100368] Modules linked in: iptable_raw xt_nat xt_REDIRECT veth openvswitch(OF) gre vxlan ip_tunnel libcrc32c dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ipmi_devintf 8021q garp stp mrp llc bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel gpio_ich joydev aes_x86_64 lrw gf128mul glue_helper ablk_helper ipmi_si cryptd sb_edac wmi lpc_ich edac_core mac_hid acpi_power_meter nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache lp parport hid_generic ixgbe fnic libfcoe dca ptp
[3367068.100409] libfc megaraid_sas pps_core mdio usbhid scsi_transport_fc hid enic scsi_tgt
[3367068.100415] CPU: 16 PID: 153 Comm: migration/16 Tainted: GF O 3.13.0-34-generic #60-Ubuntu
[3367068.100417] Hardware name: Cisco Systems Inc UCSC-C220-M3S/UCSC-C220-M3S, BIOS C220M3.1.5.4f.0.111320130449 11/13/2013
[3367068.100419] task: ffff881fd2f517f0 ti: ffff881fd2f1c000 task.ti: ffff881fd2f1c000
[3367068.100420] RIP: 0010:[<ffffffff810f5944>] [<ffffffff810f5944>] multi_cpu_stop+0x64/0xf0
[3367068.100426] RSP: 0000:ffff881fd2f1dd98 EFLAGS: 00000246
[3367068.100427] RAX: ffffffff8180af40 RBX: 0000000000000086 RCX: 000000000000a402
[3367068.100428] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff883e607edb48
[3367068.100430] RBP: ffff881fd2f1ddb8 R08: 0000000000000282 R09: 0000000000000001
[3367068.100431] R10: 000000000000b6d8 R11: ffff881fc374dc80 R12: 0000000000014440
[3367068.100432] R13: ffff881fd291ae00 R14: ffff881fd291ae08 R15: 0000000200000010
[3367068.100433] FS: 0000000000000000(0000) GS:ffff881fffd00000(0000) knlGS:0000000000000000
[3367068.100434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3367068.100435] CR2: 00007f6202134b98 CR3: 0000000001c0e000 CR4: 00000000001407e0
[3367068.100437] Stack:
[3367068.100438] ffff883e607edb70 ffff881fffd0ede0 ffff881fffd0ede8 ffff883e607edb48
[3367068.100441] ffff881fd2f1de78 ffffffff810f5b5e ffffffff8109dfc4 ffff881fffd14440
[3367068.100443] ffff881fd2f1de08 ffffffff81097508 0000000000000000 ffff881fffd14440
[3367068.100446] Call Trace:
[3367068.100450] [<ffffffff810f5b5e>] cpu_stopper_thread+0x7e/0x150
[3367068.100454] [<ffffffff8109dfc4>] ? vtime_common_task_switch+0x24/0x40
[3367068.100458] [<ffffffff81097508>] ? finish_task_switch+0x128/0x170
[3367068.100462] [<ffffffff8171fd41>] ? __schedule+0x381/0x7d0
[3367068.100465] [<ffffffff810926af>] smpboot_thread_fn+0xff/0x1b0
[3367068.100467] [<ffffffff810925b0>] ? SyS_setgroups+0x1a0/0x1a0
[3367068.100470] [<ffffffff8108b3d2>] kthread+0xd2/0xf0
[3367068.100473] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[3367068.100477] [<ffffffff8172c6bc>] ret_from_fork+0x7c/0xb0
[3367068.100479] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[3367068.100480] Code: db 85 db 41 0f 95 c5 31 f6 31 d2 eb 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da <83> fa 04 74 3d f3 90 41 8b 5c 24 20 39 d3 74 f0 83 fb 02 75 d7
"""
I'm explaining WHY this is happening in the first comments and HOW to
fix it.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461620/+subscriptions
References