kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #103020
[Bug 1413540] Re: soft lockup issues with nested KVM VMs running tempest
** Description changed:
-
[Impact]
Users of nested KVM for testing openstack have soft lockups as follows:
- [74180.076007] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:14590]
- <snip>
- [74180.076007] Call Trace:
- [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007] [<ffffffff810dbf75>] smp_call_function_single+0xe5/0x190
- [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007] [<ffffffffa00c4300>] ? rmap_write_protect+0x80/0x80 [kvm]
- [74180.076007] [<ffffffff810dc3a6>] smp_call_function_many+0x286/0x2d0
- [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007] [<ffffffff8105c8f7>] native_flush_tlb_others+0x37/0x40
- [74180.076007] [<ffffffff8105c9cb>] flush_tlb_mm_range+0x5b/0x230
- [74180.076007] [<ffffffff8105b80d>] pmdp_splitting_flush+0x3d/0x50
- [74180.076007] [<ffffffff811ac95b>] __split_huge_page+0xdb/0x720
- [74180.076007] [<ffffffff811ad008>] split_huge_page_to_list+0x68/0xd0
- [74180.076007] [<ffffffff811ad9a6>] __split_huge_page_pmd+0x136/0x330
- [74180.076007] [<ffffffff8117728d>] unmap_page_range+0x7dd/0x810
- [74180.076007] [<ffffffffa00a66b5>] ? kvm_mmu_notifier_invalidate_range_start+0x75/0x90 [kvm]
- [74180.076007] [<ffffffff81177341>] unmap_single_vma+0x81/0xf0
- [74180.076007] [<ffffffff811784ed>] zap_page_range+0xed/0x150
- [74180.076007] [<ffffffff8108ed74>] ? hrtimer_start_range_ns+0x14/0x20
- [74180.076007] [<ffffffff81174fbf>] SyS_madvise+0x3bf/0x850
- [74180.076007] [<ffffffff810db841>] ? SyS_futex+0x71/0x150
- [74180.076007] [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f
+
+ PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"
+ #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
+ #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
+ #2 [ffff88043fd03e30] panic at ffffffff81719ff4
+ #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
+ #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
+ #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
+ #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
+ #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
+ #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
+ --- <IRQ stack> ---
+ #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
+ [exception RIP: generic_exec_single+130]
+ RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202
+ RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001
+ RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
+ RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68
+ R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000
+ R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000
+ ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
+ #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
+ #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
+ #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
+ #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
+ #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
+ #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
+ #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
+ #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
+ #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
+ #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
+ #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
+ #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
+ #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
+ RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293
+ RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff
+ RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000
+ RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738
+ R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700
+ R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000
+ ORIG_RAX: 000000000000001c CS: 0033 SS: 002b
+
[Test Case]
- Deploy openstack on openstack
- Run tempest on L1 cloud
- Check kernel log of L1 nova-compute nodes
+
+ (Although this may not necessarily be related to nested KVM)
+ Potentially related: https://lkml.org/lkml/2014/11/14/656
--
Original Description:
When installing qemu-kvm on a VM, KSM is enabled.
I have encountered this problem in trusty:$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
The way to see the behaviour:
1) $ more /sys/kernel/mm/ksm/run
0
2) $ sudo apt-get install qemu-kvm
3) $ more /sys/kernel/mm/ksm/run
1
To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
I am not sure whether the problem is that we are enabling KSM on a VM or
the problem is that nested KSM is not behaving properly. Either way I
can easily reproduce, please contact me if you need further details.
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1413540
Title:
soft lockup issues with nested KVM VMs running tempest
Status in linux package in Ubuntu:
Confirmed
Bug description:
[Impact]
Users of nested KVM for testing openstack have soft lockups as follows:
PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
#1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
#2 [ffff88043fd03e30] panic at ffffffff81719ff4
#3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
#4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
#5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
#6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
#7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
#8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
#9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001
RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68
R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000
R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
#11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
#12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
#15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
#16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
#17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
#18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
#19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
#20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
#21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
#22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293
RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff
RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000
RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738
R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700
R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 000000000000001c CS: 0033 SS: 002b
[Test Case]
- Deploy openstack on openstack
- Run tempest on L1 cloud
- Check kernel log of L1 nova-compute nodes
(Although this may not necessarily be related to nested KVM)
Potentially related: https://lkml.org/lkml/2014/11/14/656
--
Original Description:
When installing qemu-kvm on a VM, KSM is enabled.
I have encountered this problem in trusty:$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
The way to see the behaviour:
1) $ more /sys/kernel/mm/ksm/run
0
2) $ sudo apt-get install qemu-kvm
3) $ more /sys/kernel/mm/ksm/run
1
To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
I am not sure whether the problem is that we are enabling KSM on a VM
or the problem is that nested KSM is not behaving properly. Either way
I can easily reproduce, please contact me if you need further details.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions