kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #120030
[Bug 1458045] Re: KVM and CFS bandwidth control causes kernel crashes (oops)
I suspect the following has occured:
https://github.com/torvalds/linux/blob/v3.16/kernel/sched/fair.c#L4829
if (!cfs_rq->nr_running)
goto idle;
put_prev_task(rq, prev);
do {
se = pick_next_entity(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
We can see that `nr_running` is actually set to 0 when checking the
dumps
crash> cfs_rq.nr_running ffff883ffedf3140
nr_running = 0
When following the call order of `put_prev_task`, it actually calls the
class-specific `put_prev_task_fair`. This involves calling
`put_prev_entity`. As part of that function, it does the following:
/* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq);
If we start tracing on what `check_cfs_rq_runtime` does, we can see that
it calls `throttle_cfs_rq` which actually manipulates the `nr_running`.
If `nr_running` was not 0 but was set to 0 during `put_prev_task`, the
system crashes as it attempts to pick an entity (but cannot find any).
This is my first time debugging Kernel issues so this is just what I
think so far, but comments are welcome..
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1458045
Title:
KVM and CFS bandwidth control causes kernel crashes (oops)
Status in linux package in Ubuntu:
Confirmed
Bug description:
We've seen this crash at least 3 times when we start setting CPU
limits using `cgroups`. It makes using CPU limits impossible, causing
instabilities in the operating system. Finally, after installing
linux-crashdump, we got a full copy of the crash message.
========================================================
[146055.357476] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.361890] PGD 0
[146055.364131] Oops: 0000 [#1] SMP
[146055.366475] Modules linked in: vhost_net vhost macvtap macvlan act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm crct10dif_pclmul crc32_pclmul
[146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last unloaded: ipmi_si]
[146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted 3.16.0-37-generic #51~14.04.1-Ubuntu
[146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 08/28/2014
[146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti: ffff883a1c168000
[146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
[146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000044aa200
[146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff883ffedf3140
[146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09: 0000000000000001
[146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12: ffff883ffedf3140
[146055.460968] R13: 000000000000001f R14: 0000000000000001 R15: ffff883ffedf30c0
[146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000) knlGS:ffff880002380000
[146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4: 00000000001427e0
[146055.489134] Stack:
[146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f ffff883a1c16bd68
[146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0 ffff883fcab69e90
[146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895 ffff883fcab6a3c8
[146055.519551] Call Trace:
[146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
[146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
[146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
[146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
[146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
[146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
[146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
[146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
[146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
[146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
[146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
[146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
[146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
[146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80
[146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.667524] RSP <ffff883a1c16bce8>
[146055.677082] CR2: 0000000000000038
========================================================
I've found the following "potential" fix that doesn't seem to have
every made it through: https://lkml.org/lkml/2015/4/7/611
In addition, I have a 12GB dump file generated by linux-crashdump,
please let me know if there's anything I can do with it which can help
troubleshoot this issue.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1458045/+subscriptions
References