← Back to team overview

kernel-packages team mailing list archive

[Bug 1413540] Re: soft lockup issues with nested KVM VMs running tempest

 

** Description changed:

- 
  [Impact]
  Users of nested KVM for testing openstack have soft lockups as follows:
- [74180.076007] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:14590]
- <snip>
- [74180.076007] Call Trace:
- [74180.076007]  [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007]  [<ffffffff810dbf75>] smp_call_function_single+0xe5/0x190
- [74180.076007]  [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007]  [<ffffffffa00c4300>] ? rmap_write_protect+0x80/0x80 [kvm]
- [74180.076007]  [<ffffffff810dc3a6>] smp_call_function_many+0x286/0x2d0
- [74180.076007]  [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80
- [74180.076007]  [<ffffffff8105c8f7>] native_flush_tlb_others+0x37/0x40
- [74180.076007]  [<ffffffff8105c9cb>] flush_tlb_mm_range+0x5b/0x230
- [74180.076007]  [<ffffffff8105b80d>] pmdp_splitting_flush+0x3d/0x50
- [74180.076007]  [<ffffffff811ac95b>] __split_huge_page+0xdb/0x720
- [74180.076007]  [<ffffffff811ad008>] split_huge_page_to_list+0x68/0xd0
- [74180.076007]  [<ffffffff811ad9a6>] __split_huge_page_pmd+0x136/0x330
- [74180.076007]  [<ffffffff8117728d>] unmap_page_range+0x7dd/0x810
- [74180.076007]  [<ffffffffa00a66b5>] ? kvm_mmu_notifier_invalidate_range_start+0x75/0x90 [kvm]
- [74180.076007]  [<ffffffff81177341>] unmap_single_vma+0x81/0xf0
- [74180.076007]  [<ffffffff811784ed>] zap_page_range+0xed/0x150
- [74180.076007]  [<ffffffff8108ed74>] ? hrtimer_start_range_ns+0x14/0x20
- [74180.076007]  [<ffffffff81174fbf>] SyS_madvise+0x3bf/0x850
- [74180.076007]  [<ffffffff810db841>] ? SyS_futex+0x71/0x150
- [74180.076007]  [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f
+ 
+ PID: 22262  TASK: ffff8804274bb000  CPU: 1   COMMAND: "qemu-system-x86"
+  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
+  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
+  #2 [ffff88043fd03e30] panic at ffffffff81719ff4
+  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
+  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
+  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
+  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
+  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
+  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
+ --- <IRQ stack> ---
+  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
+     [exception RIP: generic_exec_single+130]
+     RIP: ffffffff810dbe62  RSP: ffff880426f0da00  RFLAGS: 00000202
+     RAX: 0000000000000002  RBX: ffff880426f0d9d0  RCX: 0000000000000001
+     RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
+     RBP: ffff880426f0da30   R8: ffffffff8180ad48   R9: ffff88042713bc68
+     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: ffff8804274bb000
+     R13: 0000000000000000  R14: ffff880407670280  R15: 0000000000000000
+     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
+ #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
+ #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
+ #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
+ #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
+ #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
+ #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
+ #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
+ #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
+ #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
+ #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
+ #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
+ #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
+ #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
+     RIP: 00007fe7ca2cc647  RSP: 00007fe7be9febf0  RFLAGS: 00000293
+     RAX: 000000000000001c  RBX: ffffffff8173196d  RCX: ffffffffffffffff
+     RDX: 0000000000000004  RSI: 00000000007fb000  RDI: 00007fe7be1ff000
+     RBP: 0000000000000000   R8: 0000000000000000   R9: 00007fe7d1cd2738
+     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: 00007fe7be9ff700
+     R13: 00007fe7be9ff9c0  R14: 0000000000000000  R15: 0000000000000000
+     ORIG_RAX: 000000000000001c  CS: 0033  SS: 002b
+ 
  
  [Test Case]
  - Deploy openstack on openstack
  - Run tempest on L1 cloud
  - Check kernel log of L1 nova-compute nodes
+ 
+ (Although this may not necessarily be related to nested KVM)
+ Potentially related: https://lkml.org/lkml/2014/11/14/656
  
  --
  
  Original Description:
  
  When installing qemu-kvm on a VM, KSM is enabled.
  
  I have encountered this problem in trusty:$ lsb_release -a
  Distributor ID: Ubuntu
  Description:    Ubuntu 14.04.1 LTS
  Release:        14.04
  Codename:       trusty
  $ uname -a
  Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
  
  The way to see the behaviour:
  1) $ more /sys/kernel/mm/ksm/run
  0
  2) $ sudo apt-get install qemu-kvm
  3) $ more /sys/kernel/mm/ksm/run
  1
  
  To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
   24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  
  I am not sure whether the problem is that we are enabling KSM on a VM or
  the problem is that nested KSM is not behaving properly. Either way I
  can easily reproduce, please contact me if you need further details.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1413540

Title:
  soft lockup issues with nested KVM VMs running tempest

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]
  Users of nested KVM for testing openstack have soft lockups as follows:

  PID: 22262  TASK: ffff8804274bb000  CPU: 1   COMMAND: "qemu-system-x86"
   #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
   #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
   #2 [ffff88043fd03e30] panic at ffffffff81719ff4
   #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
   #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
   #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
   #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
   #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
   #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
  --- <IRQ stack> ---
   #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
      [exception RIP: generic_exec_single+130]
      RIP: ffffffff810dbe62  RSP: ffff880426f0da00  RFLAGS: 00000202
      RAX: 0000000000000002  RBX: ffff880426f0d9d0  RCX: 0000000000000001
      RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
      RBP: ffff880426f0da30   R8: ffffffff8180ad48   R9: ffff88042713bc68
      R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: ffff8804274bb000
      R13: 0000000000000000  R14: ffff880407670280  R15: 0000000000000000
      ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
  #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
  #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
  #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
  #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
  #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
  #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
  #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
  #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
  #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
  #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
  #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
  #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
  #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
      RIP: 00007fe7ca2cc647  RSP: 00007fe7be9febf0  RFLAGS: 00000293
      RAX: 000000000000001c  RBX: ffffffff8173196d  RCX: ffffffffffffffff
      RDX: 0000000000000004  RSI: 00000000007fb000  RDI: 00007fe7be1ff000
      RBP: 0000000000000000   R8: 0000000000000000   R9: 00007fe7d1cd2738
      R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: 00007fe7be9ff700
      R13: 00007fe7be9ff9c0  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 000000000000001c  CS: 0033  SS: 002b

  
  [Test Case]
  - Deploy openstack on openstack
  - Run tempest on L1 cloud
  - Check kernel log of L1 nova-compute nodes

  (Although this may not necessarily be related to nested KVM)
  Potentially related: https://lkml.org/lkml/2014/11/14/656

  --

  Original Description:

  When installing qemu-kvm on a VM, KSM is enabled.

  I have encountered this problem in trusty:$ lsb_release -a
  Distributor ID: Ubuntu
  Description:    Ubuntu 14.04.1 LTS
  Release:        14.04
  Codename:       trusty
  $ uname -a
  Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

  The way to see the behaviour:
  1) $ more /sys/kernel/mm/ksm/run
  0
  2) $ sudo apt-get install qemu-kvm
  3) $ more /sys/kernel/mm/ksm/run
  1

  To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
   24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]

  I am not sure whether the problem is that we are enabling KSM on a VM
  or the problem is that nested KSM is not behaving properly. Either way
  I can easily reproduce, please contact me if you need further details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions