kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #81296
[Bug 1372975] [NEW] ISST-KVM : Ubuntu 14.04 guest actg1 running TCP and IO hanging
You have been subscribed to a public bug:
Problem Description:
--------------------------
PowerKVM guest actg1 is hanging while running IO and TCP tests. Unable to ssh or virsh console to the guest.
While running IO and TCP tests the guest actg1 is giving below CPU stall
trace :
[16091.824841] INFO: rcu_sched self-detected stall on CPU { 1} (t=670218 jiffies g=122793 c=122792 q=6344067)
[16154.853532] INFO: rcu_sched self-detected stall on CPU { 1} (t=676521 jiffies g=122793 c=122792 q=6345885)
[16217.882223] INFO: rcu_sched self-detected stall on CPU { 1} (t=682824 jiffies g=122793 c=122792 q=6347609)
[16280.910913] INFO: rcu_sched self-detected stall on CPU { 1} (t=689127 jiffies g=122793 c=122792 q=6349994)
[16343.939603] INFO: rcu_sched self-detected stall on CPU { 1} (t=695430 jiffies g=122793 c=122792 q=6351789)
[16406.968293] INFO: rcu_sched self-detected stall on CPU { 1} (t=701733 jiffies g=122793 c=122792 q=6353519)
[16469.996983] INFO: rcu_sched self-detected stall on CPU { 1} (t=708036 jiffies g=122793 c=122792 q=6355275)
[16533.025674] INFO: rcu_sched self-detected stall on CPU { 1} (t=714339 jiffies g=122793 c=122792 q=6357064)
[16596.054364] INFO: rcu_sched self-detected stall on CPU { 1} (t=720642 jiffies g=122793 c=122792 q=6358808)
[16659.083056] INFO: rcu_sched self-detected stall on CPU { 1} (t=726945 jiffies g=122793 c=122792 q=6360538)
[16722.111749] INFO: rcu_sched self-detected stall on CPU { 1} (t=733248 jiffies g=122793 c=122792 q=6362273)
XML file :
---------
<domain type='kvm' id='2'>
<name>actg1</name>
<uuid>01a4e83f-2275-43b2-a4b3-31ada581141d</uuid>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>2</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='ppc64' machine='pseries'>hvm</type>
<boot dev='hd'/>
<boot dev='cdrom'/>
<boot dev='network'/>
<bootmenu enable='yes'/>
</os>
<features>
<acpi/>
<apic/>
</features>
<cpu mode='custom' match='exact'>
<model fallback='allow'>power8</model>
<topology sockets='2' cores='2' threads='2'/>
</cpu>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/bin/qemu-system-ppc64</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000a1'/> =======================> SAN (coho)
<target dev='hda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/disk/by-id/scsi-1IBM_IPR-0_5D81B900000010C0'/> ============================> LocalCEC
<target dev='hdb' bus='scsi'/>
<alias name='scsi0-0-0-1'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw'/>
<source dev='/dev/disk/by-id/scsi-1IBM_IPR-0_5D81B900000010A0'/> ============================> LocalCEC
<target dev='hdc' bus='scsi'/>
<alias name='scsi0-0-0-2'/>
<address type='drive' controller='0' bus='0' target='0' unit='2'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000cd'/> ============================> iscsi
<target dev='hdd' bus='scsi'/>
<alias name='scsi0-0-0-3'/>
<address type='drive' controller='0' bus='0' target='0' unit='3'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw'/>
<source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000cc'/> ============================> iscsi
<target dev='hde' bus='scsi'/>
<alias name='scsi0-0-0-4'/>
<address type='drive' controller='0' bus='0' target='0' unit='4'/>
</disk>
<controller type='scsi' index='0'>
<alias name='scsi0'/>
<address type='spapr-vio' reg='0x2000'/>
</controller>
<controller type='scsi' index='1' model='virtio-scsi'>
<alias name='scsi1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</controller>
<controller type='usb' index='0' model='nec-xhci'>
<alias name='usb0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:37:24:fe'/>
<source bridge='brenP6p1s0f1'/>
<target dev='vnet0'/>
<model type='spapr-vlan'/>
<alias name='net0'/>
<address type='spapr-vio' reg='0x1000'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/1'/>
<target port='0'/>
<alias name='serial0'/>
<address type='spapr-vio' reg='0x30000000'/>
</serial>
<console type='pty' tty='/dev/pts/1'>
<source path='/dev/pts/1'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
<address type='spapr-vio' reg='0x30000000'/>
</console>
<video>
<model type='vga' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
</video>
</devices>
<seclabel type='dynamic' model='selinux' relabel='yes'>
<label>system_u:system_r:svirt_t:s0:c405,c741</label>
<imagelabel>system_u:object_r:svirt_image_t:s0:c405,c741</imagelabel>
</seclabel>
</domain>
Host configuration:
-------------------
[root@actkvm home]# cat /etc/issue
IBM_PowerKVM release 2.1.0 build 27 service (pkvm2_1)
Kernel \r on a \m (\l)
[root@actkvm home]# virsh version
Compiled against library: libvirt 1.1.3
Using library: libvirt 1.1.3
Using API: QEMU 1.1.3
Running hypervisor: QEMU 1.6.0
Dropped into xmon, much of the same, soft lockups between raw_spin_lock
and shrink_dentry_list.
The guest manages to get a fair bit into its syslog before locking up, I
think the two most interesting bits are
Sep 18 05:44:46 actg1 kernel: [ 8349.373013] ------------[ cut here ]------------
Sep 18 05:44:46 actg1 kernel: [ 8349.373019] WARNING: at /build/buildd/linux-3.13.0/fs/dcache.c:362
Sep 18 05:44:46 actg1 kernel: [ 8349.373020] Modules linked in: rpcsec_gss_krb5 nfsv4 dm_round_robin dm_multipath scsi_dh pseries_rng nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache rtc_generic virtio_scsi
Sep 18 05:44:46 actg1 kernel: [ 8349.373036] CPU: 1 PID: 22289 Comm: host01 Not tainted 3.13.0-35-generic #62-Ubuntu
Sep 18 05:44:46 actg1 kernel: [ 8349.373038] task: c0000000fcb40000 ti: c0000000faec0000 task.ti: c0000000faec0000
Sep 18 05:44:46 actg1 kernel: [ 8349.373040] NIP: c000000000274980 LR: c0000000002757f8 CTR: c0000000002e2f90
Sep 18 05:44:46 actg1 kernel: [ 8349.373042] REGS: c0000000faec3590 TRAP: 0700 Not tainted (3.13.0-35-generic)
Sep 18 05:44:46 actg1 kernel: [ 8349.373042] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 22222848 XER: 20000000
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] CFAR: c000000000274900 SOFTE: 1
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR00: c0000000002757f8 c0000000faec3810 c00000000164f6a0 c0000000f9b1c840
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR04: c0000000faec39b8 0000000001af2460 0000000000000000 0000000000000000
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR08: c000000001708ff0 0000000000000001 c00000000150f6a0 00000000000000b4
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR12: 0000000022222842 c00000000fe40380 0000010016693908 00000100166939c8
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR16: 000000001016f5a8 000000000000000a 0000000000000000 0000000000200000
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR20: 0000000000000000 c000000000c222a8 0000000000000004 c0000000fbc04280
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR24: 0000000000001143 c0000000f9b1ed00 0000000000000000 0000000000000000
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR28: c0000000f9b1c840 c0000000f9b1c840 c0000000faec39b8 c0000000f9b1c840
Sep 18 05:44:46 actg1 kernel: [ 8349.373070] NIP [c000000000274980] .d_shrink_del+0xb0/0xd0
Sep 18 05:44:46 actg1 kernel: [ 8349.373072] LR [c0000000002757f8] .shrink_dentry_list+0xc8/0x160
Sep 18 05:44:46 actg1 kernel: [ 8349.373073] Call Trace:
Sep 18 05:44:46 actg1 kernel: [ 8349.373077] [c0000000faec3810] [c00000000095c140] ._raw_spin_lock+0xb0/0xc0 (unreliable)
Sep 18 05:44:46 actg1 kernel: [ 8349.373080] [c0000000faec3890] [c0000000002757f8] .shrink_dentry_list+0xc8/0x160
Sep 18 05:44:46 actg1 kernel: [ 8349.373081] [c0000000faec3940] [c000000000275a54] .shrink_dcache_parent+0x44/0xa0
Sep 18 05:44:46 actg1 kernel: [ 8349.373085] [c0000000faec39f0] [c0000000002ea3fc] .proc_flush_task+0xcc/0x200
Sep 18 05:44:46 actg1 kernel: [ 8349.373088] [c0000000faec3ae0] [c00000000009e16c] .release_task+0x9c/0x540
Sep 18 05:44:46 actg1 kernel: [ 8349.373090] [c0000000faec3bc0] [c00000000009ebe8] .wait_task_zombie+0x5d8/0x760
Sep 18 05:44:46 actg1 kernel: [ 8349.373092] [c0000000faec3c80] [c00000000009f6c0] .do_wait+0x120/0x2c0
Sep 18 05:44:46 actg1 kernel: [ 8349.373095] [c0000000faec3d40] [c0000000000a0d40] .SyS_wait4+0xa0/0x140
Sep 18 05:44:46 actg1 kernel: [ 8349.373097] [c0000000faec3e30] [c00000000000a158] syscall_exit+0x0/0x98
Sep 18 05:44:46 actg1 kernel: [ 8349.373098] Instruction dump:
Sep 18 05:44:46 actg1 kernel: [ 8349.373099] 3908ffff 7d09512a 4bd9c529 60000000 38210080 e8010010 ebe1fff8 7c0803a6
Sep 18 05:44:46 actg1 kernel: [ 8349.373102] 4e800020 3d42ffec 892ae58d 69290001 <0b090000> 2fa90000 41feff7c 39200001
and then
Sep 18 05:47:32 actg1 kernel: [ 8514.856141] JIT Sampler invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Sep 18 05:47:32 actg1 kernel: [ 8514.856150] JIT Sampler cpuset=/ mems_allowed=0
Sep 18 05:47:32 actg1 kernel: [ 8514.856156] CPU: 0 PID: 3467 Comm: JIT Sampler Tainted: G W 3.13.0-35-generic #62-Ubuntu
Sep 18 05:47:32 actg1 kernel: [ 8514.856159] Call Trace:
Sep 18 05:47:32 actg1 kernel: [ 8514.856170] [c0000000f58a2cf0] [c000000000016b00] .show_stack+0x170/0x290 (unreliable)
Sep 18 05:47:32 actg1 kernel: [ 8514.856179] [c0000000f58a2de0] [c000000000970b5c] .dump_stack+0x88/0xb4
Sep 18 05:47:32 actg1 kernel: [ 8514.856183] [c0000000f58a2e60] [c00000000096a43c] .dump_header+0xc0/0x248
Sep 18 05:47:32 actg1 kernel: [ 8514.856187] [c0000000f58a2f40] [c0000000001c6bf8] .oom_kill_process+0x318/0x4c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856191] [c0000000f58a3020] [c0000000001c7598] .out_of_memory+0x4f8/0x530
Sep 18 05:47:32 actg1 kernel: [ 8514.856194] [c0000000f58a3100] [c0000000001cea60] .__alloc_pages_nodemask+0xbb0/0xc70
Sep 18 05:47:32 actg1 kernel: [ 8514.856199] [c0000000f58a32f0] [c000000000224958] .alloc_pages_current+0xb8/0x1a0
Sep 18 05:47:32 actg1 kernel: [ 8514.856203] [c0000000f58a3390] [c0000000001c2efc] .__page_cache_alloc+0xcc/0x110
Sep 18 05:47:32 actg1 kernel: [ 8514.856206] [c0000000f58a3420] [c0000000001c4884] .filemap_fault+0x204/0x4b0
Sep 18 05:47:32 actg1 kernel: [ 8514.856210] [c0000000f58a3500] [c0000000001f6d30] .__do_fault+0xc0/0x6f0
Sep 18 05:47:32 actg1 kernel: [ 8514.856214] [c0000000f58a35f0] [c0000000001fc004] .handle_mm_fault+0x214/0xde0
Sep 18 05:47:32 actg1 kernel: [ 8514.856218] [c0000000f58a36e0] [c00000000095e8f8] .do_page_fault+0x388/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856221] [c0000000f58a3910] [c000000000009568] handle_page_fault+0x10/0x30
Sep 18 05:47:32 actg1 kernel: [ 8514.856227] --- Exception: 301 at .do_page_fault+0x468/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856227] LR = .do_page_fault+0x348/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856231] [c0000000f58a3e30] [c000000000009568] handle_page_fault+0x10/0x30
Sep 18 05:47:32 actg1 kernel: [ 8514.856233] Mem-Info:
Sep 18 05:47:32 actg1 kernel: [ 8514.856236] Node 0 DMA per-cpu:
Sep 18 05:47:32 actg1 kernel: [ 8514.856239] CPU 0: hi: 6, btch: 1 usd: 0
Sep 18 05:47:32 actg1 kernel: [ 8514.856241] CPU 1: hi: 6, btch: 1 usd: 0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] active_anon:274 inactive_anon:317 isolated_anon:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] active_file:24 inactive_file:58 isolated_file:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] unevictable:283 dirty:0 writeback:317 unstable:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] free:2001 slab_reclaimable:8289 slab_unreclaimable:53322
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] mapped:159 shmem:0 pagetables:92 bounce:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] free_cma:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856253] Node 0 DMA free:128064kB min:131072kB low:163840kB high:196608kB active_anon:17536kB inactive_anon:20288kB active_file:1536kB inactive_file:3712kB unevictable:18112kB isolated(anon):0kB isola
ted(file):0kB present:4194304kB managed:4171392kB mlocked:18112kB dirty:0kB writeback:20288kB mapped:10176kB shmem:0kB slab_reclaimable:530496kB slab_unreclaimable:3412608kB kernel_stack:1014816kB pagetables:5888kB unstable:0kB bounce:0
kB free_cma:0kB writeback_tmp:0kB pages_scanned:4216 all_unreclaimable? yes
Sep 18 05:47:32 actg1 kernel: [ 8514.856263] lowmem_reserve[]: 0 0 0
Sep 18 05:47:32 actg1 kernel: [ 8514.856267] Node 0 DMA: 771*64kB (EM) 460*128kB (M) 0*256kB 1*512kB (R) 0*1024kB 1*2048kB (R) 2*4096kB (R) 2*8192kB (R) 0*16384kB = 135360kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856280] 626 total pagecache pages
Sep 18 05:47:32 actg1 kernel: [ 8514.856283] 446 pages in swap cache
Sep 18 05:47:32 actg1 kernel: [ 8514.856285] Swap cache stats: add 21120, delete 20674, find 8129/11084
Sep 18 05:47:32 actg1 kernel: [ 8514.856287] Free swap = 727424kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856288] Total swap = 917440kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856290] 65536 pages RAM
Sep 18 05:47:32 actg1 kernel: [ 8514.856291] 0 pages HighMem/MovableOnly
Sep 18 05:47:32 actg1 kernel: [ 8514.856293] 358 pages reserved
...etc/etc
For what it's worth, my guest which reproduces this problem under 14.04.1 (but does not reproduce it under 14.10)
There are a number of patches affecting dcache code between 14.04.1 and
14.10 - in particular:
commit 046b961b45f93a92e4c70525a12f3d378bced130
Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date: Thu May 29 08:54:52 2014 -0400
shrink_dentry_list(): take parent's ->d_lock earlier
The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one. d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.
...
...
3.13.0-35.62 which 14.04.1 is running does not include this patch, but a tag 3.13.0-36.63 exists in the trusty kernel tree which does, probably a future update.
I've built a kernel from this tag and installed it in the guest - please retest.
Hi Samuel,
The run completed 26 hours successfully. Seems the issue is solved.
As discussed , please let me know when .deb is available for this patch
** Affects: linux (Ubuntu)
Importance: Undecided
Status: Confirmed
** Tags: architecture-ppc64 bugnameltc-115314 severity-critical targetmilestone-inin1404
--
ISST-KVM : Ubuntu 14.04 guest actg1 running TCP and IO hanging
https://bugs.launchpad.net/bugs/1372975
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.