group.of.nepali.translators team mailing list archive
-
group.of.nepali.translators team
-
Mailing list archive
-
Message #13280
[Bug 1659111] Re: UbuntuKVM guest crashed while running I/O stress test with Ubuntu kernel 4.4.0-47-generic
This bug was fixed in the package linux - 4.4.0-78.99
---------------
linux (4.4.0-78.99) xenial; urgency=low
* linux: 4.4.0-78.99 -proposed tracker (LP: #1686645)
* Please backport fix to reference leak in cgroup blkio throttle
(LP: #1683976)
- block: fix module reference leak on put_disk() call for cgroups throttle
* UbuntuKVM guest crashed while running I/O stress test with Ubuntu kernel
4.4.0-47-generic (LP: #1659111)
- block: Unhash block device inodes on gendisk destruction
- block: Use pointer to backing_dev_info from request_queue
- block: Dynamically allocate and refcount backing_dev_info
- block: Make blk_get_backing_dev_info() safe without open bdev
- block: Get rid of blk_get_backing_dev_info()
- block: Move bdev_unhash_inode() after invalidate_partition()
- block: Unhash also block device inode for the whole device
- block: Revalidate i_bdev reference in bd_aquire()
- block: Initialize bd_bdi on inode initialization
- block: Move bdi_unregister() to del_gendisk()
- block: Allow bdi re-registration
- bdi: Fix use-after-free in wb_congested_put()
- block: Make del_gendisk() safer for disks without queues
- block: Fix bdi assignment to bdev inode when racing with disk delete
- bdi: Mark congested->bdi as internal
- bdi: Make wb->bdi a proper reference
- bdi: Unify bdi->wb_list handling for root wb_writeback
- bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
- bdi: Do not wait for cgwbs release in bdi_unregister()
- bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
- block: Fix oops in locked_inode_to_wb_and_lock_list()
- kobject: Export kobject_get_unless_zero()
- block: Fix oops scsi_disk_get()
* Touchpad not working correctly after kernel upgrade (LP: #1662589)
- Input: ALPS - fix V8+ protocol handling (73 03 28)
* Xenial update to v4.4.62 stable release (LP: #1683728)
- drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
- drm/i915: Stop using RP_DOWN_EI on Baytrail
- usb: dwc3: gadget: delay unmap of bounced requests
- mtd: bcm47xxpart: fix parsing first block after aligned TRX
- MIPS: Introduce irq_stack
- MIPS: Stack unwinding while on IRQ stack
- MIPS: Only change $28 to thread_info if coming from user mode
- MIPS: Switch to the irq_stack in interrupts
- MIPS: Select HAVE_IRQ_EXIT_ON_IRQ_STACK
- MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch
- crypto: caam - fix RNG deinstantiation error checking
- Linux 4.4.62
* ifup service of network device stay active after driver stop (LP: #1672144)
- net: use net->count to check whether a netns is alive or not
* [Hyper-V] mkfs regression in kernel 4.4+ (LP: #1682215)
- block: relax check on sg gap
* [Feature] KBL: intel_powerclamp driver support (LP: #1591641)
- thermal/powerclamp: remove cpu whitelist
- thermal/powerclamp: correct cpu support check
- thermal/powerclamp: add back module device table
* sysfs channel reads of lps22hb pressure sensor are stale (LP: #1682103)
- iio: st_pressure: initialize lps22hb bootime
* Backlight control does not work and there are no entries in
/sys/class/backlight (LP: #1667323)
- Revert "ACPI / video: Add force_native quirk for HP Pavilion dv6"
* [Feature] KBL: intel_rapl driver support (LP: #1591640)
- powercap/intel_rapl: Add support for Kabylake
* Xenial update to v4.4.61 stable release (LP: #1682140)
- drm/vmwgfx: Type-check lookups of fence objects
- drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
- drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
- drm/ttm, drm/vmwgfx: Relax permission checking when opening surfaces
- drm/vmwgfx: Remove getparam error message
- drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
- sysfs: be careful of error returns from ops->show()
- staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
- arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
- arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
- iio: bmg160: reset chip when probing
- Reset TreeId to zero on SMB2 TREE_CONNECT
- ptrace: fix PTRACE_LISTEN race corrupting task->state
- ring-buffer: Fix return value check in test_ringbuffer()
- metag/usercopy: Drop unused macros
- metag/usercopy: Fix alignment error checking
- metag/usercopy: Add early abort to copy_to_user
- metag/usercopy: Zero rest of buffer from copy_from_user
- metag/usercopy: Set flags before ADDZ
- metag/usercopy: Fix src fixup in from user rapf loops
- metag/usercopy: Add missing fixups
- powerpc/mm: Add missing global TLB invalidate if cxl is active
- powerpc: Don't try to fix up misaligned load-with-reservation instructions
- nios2: reserve boot memory for device tree
- s390/decompressor: fix initrd corruption caused by bss clear
- s390/uaccess: get_user() should zero on failure (again)
- MIPS: Force o32 fp64 support on 32bit MIPS64r6 kernels
- MIPS: ralink: Fix typos in rt3883 pinctrl
- MIPS: End spinlocks with .insn
- MIPS: Lantiq: fix missing xbar kernel panic
- MIPS: Flush wrong invalid FTLB entry for huge page
- mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
- Linux 4.4.61
* Xenial update to v4.4.60 stable release (LP: #1681862)
- libceph: force GFP_NOIO for socket allocations
- xen/setup: Don't relocate p2m over existing one
- scsi: mpt3sas: fix hang on ata passthrough commands
- scsi: sg: check length passed to SG_NEXT_CMD_LEN
- scsi: libsas: fix ata xfer length
- ALSA: seq: Fix race during FIFO resize
- ALSA: hda - fix a problem for lineout on a Dell AIO machine
- ASoC: atmel-classd: fix audio clock rate
- ACPI: Fix incompatibility with mcount-based function graph tracing
- ACPI: Do not create a platform_device for IOAPIC/IOxAPIC
- tty/serial: atmel: fix race condition (TX+DMA)
- tty/serial: atmel: fix TX path in atmel_console_write()
- USB: fix linked-list corruption in rh_call_control()
- KVM: x86: clear bus pointer when destroyed
- drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
- mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
- MIPS: Lantiq: Fix cascaded IRQ setup
- rtc: s35390a: fix reading out alarm
- rtc: s35390a: make sure all members in the output are set
- rtc: s35390a: implement reset routine as suggested by the reference
- rtc: s35390a: improve irq handling
- KVM: kvm_io_bus_unregister_dev() should never fail
- power: reset: at91-poweroff: timely shutdown LPDDR memories
- blk: improve order of bio handling in generic_make_request()
- blk: Ensure users for current->bio_list can see the full list.
- padata: avoid race in reordering
- Linux 4.4.60
-- Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx> Thu, 27 Apr
2017 10:24:08 -0300
** Changed in: linux (Ubuntu Xenial)
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1659111
Title:
UbuntuKVM guest crashed while running I/O stress test with Ubuntu
kernel 4.4.0-47-generic
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Xenial:
Fix Released
Status in linux source package in Yakkety:
In Progress
Status in linux source package in Zesty:
In Progress
Bug description:
Attn. Canonical: For your awareness only at this time.
== Comment: #0 - LEKSHMI C. PILLAI - 2016-11-22 03:49:38 ==
Machine INFO
KVM HOST: luckyv1
Guest :lucky05
lucky05 crashed while running the I/O stress test for SAN disks.
Installed lucky05 and enabled the xmon on that.After that started the
RAW disk test on around 50 disks.After 6-7 hours after running,Now
machine dropped into xmon.
Logs:
[25023.224182] Unable to handle kernel paging request for data at address 0x00000000
[25023.224257] Faulting instruction address: 0xc000000000324c60
cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3620]
pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
sp: c0000000fffc38a0
msr: 8000000100009033
dar: 0
dsisr: 40000000
current = 0xc0000000ff99e470
paca = 0xc00000000fb41c80 softe: 0 irq_happened: 0x01
pid = 14736, comm = kworker/u16:8
enter ? for help
[c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
[c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
[c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
3:mon> f
3:mon> th
[c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
[c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
[c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
3:mon> sh
[27384.651055] INFO: rcu_sched detected stalls on CPUs/tasks:
[27384.651220] (detected by 4, t=40598 jiffies, g=2849830, c=2849829, q=992)
[27384.651286] All QSes seen, last rcu_sched kthread activity 40596 (4301188714-4301148118), jiffies_till_next_fqs=1, root ->qsmask 0x0
[27384.651501] rcu_sched kthread starved for 40596 jiffies! g2849830 c2849829 f0x2 s3 ->state=0x0
[27384.651747] INFO: rcu_sched detected stalls on CPUs/tasks:
[27384.651905] (detected by 4, t=590354 jiffies, g=2849830, c=2849829, q=1285)
[27384.652012] All QSes seen, last rcu_sched kthread activity 590352 (4301738470-4301148118), jiffies_till_next_fqs=1, root ->qsmask 0x0
[27384.652191] rcu_sched kthread starved for 590352 jiffies! g2849830 c2849829 f0x2 s3 ->state=0x0
[27384.730645] Unable to handle kernel paging request for data at address 0xffffffffffffffd8
[27384.730781] Faulting instruction address: 0xc0000000000e7258
cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3000]
pc: c0000000000e7258: kthread_data+0x28/0x40
lr: c0000000000de940: wq_worker_sleeping+0x30/0x110
sp: c0000000fffc3280
msr: 8000000100009033
dar: ffffffffffffffd8
dsisr: 40000000
current = 0xc0000000ff99e470
paca = 0xc00000000fb41c80 softe: 0 irq_happened: 0x01
pid = 14736, comm = kworker/u16:8
enter ? for help
== Comment: #1 - LEKSHMI C. PILLAI - 2016-11-22 04:05:41 ==
3:mon> th
[c0000000fffc32b0] c0000000000de940 wq_worker_sleeping+0x30/0x110
[c0000000fffc32f0] c000000000af31bc __schedule+0x6ec/0x990
[c0000000fffc33c0] c000000000af34a8 schedule+0x48/0xc0
[c0000000fffc33f0] c0000000000bd3d0 do_exit+0x760/0xc30
[c0000000fffc34b0] c000000000020bf4 die+0x314/0x470
[c0000000fffc3540] c000000000050d98 bad_page_fault+0xd8/0x150
[c0000000fffc35b0] c000000000008680 handle_page_fault+0x2c/0x30
--- Exception: 300 (Data Access) at c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
[c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
[c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
[c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
3:mon>
== Comment: #6 - Laurent Dufour - 2016-11-23 03:00:16 ==
Logged in luckyv1, found a lot of ipr issue on this node:
[525973.896624] qla2xxx 0005:09:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[525973.956619] qla2xxx 0005:09:00.1: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[529433.834853] ipr 0001:04:00.0: FFFE: Soft device bus error recovered by the IOA
[529433.834867] ipr: -----Failing Device Information-----
[529433.834870] ipr: World Wide Unique ID: 500507605EC10C000000000000000000
[529433.834873] ipr: Device Resource Path: FF
[529433.834875] ipr: Primary Problem Description: Command Timeout
[529433.834878] ipr: Secondary Problem Description: Command timeout expired
[529433.834880] ipr: SCSI Sense Data:
[529433.834882] ipr: 00000000: 00000000 00000000 00000000 00000000
[529433.834884] ipr: 00000010: 00000000 00000000 00000000 00000000
[529433.834886] ipr: SCSI Command Descriptor Block:
[529433.834889] ipr: 00000000: 9E120004 0F000000 00000000 0020AD00
[529433.834891] ipr: Additional IOA Data:
[529433.834893] ipr: 00000000: 4646001C 44010007 00050000 04700002
[529433.834895] ipr: 00000010: 3B894A49 1EE620CC 04700002 49574631
[529433.834897] ipr: 00000020: 455300CC 06B00027 00000020 84000000
[529433.834899] ipr: 00000030: 00000000 05801000 0B29A7C0 00000000
[529433.834901] ipr: 00000040: 00000000 00000000 00000000 00000000
[529433.834904] ipr: 00000050: 00000000 00000000 00000000 00000000
[529433.834906] ipr: 00000060: 00000000 00000000 00000000 00000000
[529433.834908] ipr: 00000070: 00000000 00000000 00000000 00000000
[529433.834910] ipr: 00000080: 00000000 00000000 00000000 00000000
[529433.834912] ipr: 00000090: 00000000 00000000 00000000 00000000
[529433.834914] ipr: 000000A0: 00000000 D4000018 80000000 FFFFFFFF
[529433.834917] ipr: 000000B0: FFFFFFFF 00000000 0980EC21 00000000
[529433.834919] ipr: 000000C0: 00000000 00000000 01769A24 00000000
[529433.834921] ipr: 000000D0: 01D3C300 E0050000 FFFFFFFE 0B5A0000
[529433.834923] ipr: 000000E0: 00000000 9E120004 0F000000 00000000
[529433.834926] ipr: 000000F0: 43440010 9E120004 0F000000 00000000
[529433.834928] ipr: 00000100: 0020AD00 45480010 0100E038 9E12FFFF
[529433.834930] ipr: 00000110: 01080002 00000000 45540004 00001463
In addition there are some NFS issue reported:
[563034.817901] nfs: server 10.33.11.31 not responding, timed out
[563405.504308] nfs: server 10.33.11.31 not responding, timed out
This said, chig5 enter xmon due to a bad pointer in the kernel:
3:mon> e
cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3000]
pc: c0000000000e7258: kthread_data+0x28/0x40
lr: c0000000000de940: wq_worker_sleeping+0x30/0x110
sp: c0000000fffc3280
msr: 8000000100009033
dar: ffffffffffffffd8
dsisr: 40000000
current = 0xc0000000ff99e470
paca = 0xc00000000fb41c80 softe: 0 irq_happened: 0x01
pid = 14736, comm = kworker/u16:8
3:mon> th
[c0000000fffc32b0] c0000000000de940 wq_worker_sleeping+0x30/0x110
[c0000000fffc32f0] c000000000af31bc __schedule+0x6ec/0x990
[c0000000fffc33c0] c000000000af34a8 schedule+0x48/0xc0
[c0000000fffc33f0] c0000000000bd3d0 do_exit+0x760/0xc30
[c0000000fffc34b0] c000000000020bf4 die+0x314/0x470
[c0000000fffc3540] c000000000050d98 bad_page_fault+0xd8/0x150
[c0000000fffc35b0] c000000000008680 handle_page_fault+0x2c/0x30
--- Exception: 300 (Data Access) at c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
[c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
[c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
[c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
Looking at the other guest as Lekshmi mentioned that all the guests
are crashing.
== Comment: #7 - Laurent Dufour - 2016-11-23 03:24:34 ==
The guest lucky01 (4.4.0-47-generic) is fine :
root@lucky01:/Blast# date
Wed Nov 23 03:04:23 CST 2016
The guest lucky02 (4.4.0-47generic) has entered xmon due to the same issue as lukcy05:
7:mon> e
cpu 0x7: Vector: 300 (Data Access) at [c0000001f265b620]
pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
sp: c0000001f265b8a0
msr: 8000000100009033
dar: 0
dsisr: 40000000
current = 0xc0000001f222fcc0
paca = 0xc00000000fb44280 softe: 0 irq_happened: 0x01
pid = 12062, comm = kworker/u16:3
7:mon> t
[c0000001f265b900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000001f265ba10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000001f265ba70] c000000000328aec wb_writeback+0x30c/0x450
[c0000001f265bb40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000001f265bc50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000001f265bce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000001f265bd80] c0000000000e6980 kthread+0x110/0x130
[c0000001f265be30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
--- Exception: 0 at 0000000000000000
The guest lucky03 didn't enter xmon but is not responding any more. Unfornately sysrq is not enabled on this guest. There are still some activity on this guest.
root@luckyv1:~# virsh qemu-monitor-command --hmp lucky03 'info cpus'
* CPU #0: nip=0xc0000000001035e0 thread_id=76434
CPU #1: nip=0xc0000000000863dc thread_id=76435
CPU #2: nip=0xc0000000000863dc thread_id=76436
CPU #3: nip=0xc0000000000863dc thread_id=76437
CPU #4: nip=0xc0000000000863dc thread_id=76439
CPU #5: nip=0xc0000000000863dc thread_id=76440
CPU #6: nip=0x0000000010072f68 thread_id=76441
CPU #7: nip=0xc0000000000863dc thread_id=76442
The guest lucky04 is not responding but neither enter xmon, but sysrq are not enabled on this node.
But the node seems to be still active:
root@luckyv1:~# virsh qemu-monitor-command --hmp lucky04 'info cpus'
* CPU #0: nip=0xc000000000af8834 thread_id=68201
CPU #1: nip=0xc0000000000863dc thread_id=68202
CPU #2: nip=0xc0000000000645ac thread_id=68203
CPU #3: nip=0xc0000000000863dc thread_id=68204
CPU #4: nip=0xc0000000000863dc thread_id=68205
CPU #5: nip=0xc0000000000863dc thread_id=68206
CPU #6: nip=0xc000000000064590 thread_id=68207
CPU #7: nip=0xc000000000af8904 thread_id=68208
The guest lucky06 is alive:
root@lucky06:/# cat /proc/version; date
Linux version 4.4.0-47-generic (buildd@bos01-ppc64el-008) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #68-Ubuntu SMP Wed Oct 26 19:38:24 UTC 2016
Wed Nov 23 03:20:19 CST 2016
To summarize:
lucky01 good
lucky02 panic in locked_inode_to_wb_and_lock_list()
lucky03 not responding but still active
lucky04 not responding but still active
lucky05 panic in locked_inode_to_wb_and_lock_list()
lucky06 good
== Comment: #10 - Laurent Dufour - 2016-11-24 10:27:52 ==
Here the data I captured on lucky02 which did panic the way lucky05 did.
CPU 7 panic due to a data access error:
7:mon> e
cpu 0x7: Vector: 300 (Data Access) at [c0000001f265b620]
pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
sp: c0000001f265b8a0
msr: 8000000100009033
dar: 0
dsisr: 40000000
current = 0xc0000001f222fcc0
paca = 0xc00000000fb44280 softe: 0 irq_happened: 0x01
pid = 12062, comm = kworker/u16:3
7:mon> r
R00 = c00000000032831c R16 = c0000001fc972ef8
R01 = c0000001f265b8a0 R17 = c0000001fc972e70
R02 = c0000000015c6a00 R18 = c0000001fc972f60
R03 = c0000001fc972e70 R19 = 0000000000000000
R04 = c0000001f2230700 R20 = 0000000000000000
R05 = 0000000000000000 R21 = c0000001f2658000
R06 = 00000001fef30000 R22 = c0000001f35d5c88
R07 = 000108f684c40713 R23 = c0000001f35d5c68
R08 = 0000000000000000 R24 = 0000000000000000
R09 = 0000000000000000 R25 = c0000001fc972ef8
R10 = 0000000080000007 R26 = 0000000000000000
R11 = 00000000030883ec R27 = 0000000000000000
R12 = 0000000000000000 R28 = 0000000000000001
R13 = c00000000fb44280 R29 = c0000001fc972e70
R14 = c0000000000e6878 R30 = c0000001f265bba0
R15 = 0000000000000000 R31 = 0000000000000000
pc = c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
cfar= 00003fff9647a5a8
lr = c00000000032831c writeback_sb_inodes+0x30c/0x590
msr = 8000000100009033 cr = 24652882
ctr = c000000000110b50 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
7:mon> t
[c0000001f265b900] c00000000032831c writeback_sb_inodes+0x30c/0x590
[c0000001f265ba10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
[c0000001f265ba70] c000000000328aec wb_writeback+0x30c/0x450
[c0000001f265bb40] c0000000003296b4 wb_workfn+0x264/0x570
[c0000001f265bc50] c0000000000dd930 process_one_work+0x1e0/0x5a0
[c0000001f265bce0] c0000000000dde84 worker_thread+0x194/0x680
[c0000001f265bd80] c0000000000e6980 kthread+0x110/0x130
[c0000001f265be30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
The system tried to access data pointed by r31 which contains data retrieved from the inode address stored in r29.
The panic happened during the inline call to wb_get when inode->i_wb is used.
So here inode->i_wb is null which is not expeted to happen.
At this time, CPU 6 is waiting for the same inode's spinlock inode->i_lock to be released here:
6:mon> t
[link register ] c000000000064624 __spin_yield+0xb4/0xc0
[c0000000fdb93900] c0000000fdb93940 (unreliable)
[c0000000fdb93970] c000000000af8968 _raw_spin_lock+0xd8/0xe0
[c0000000fdb939a0] c000000000327330 __mark_inode_dirty+0xd0/0x4a0
[c0000000fdb93a20] c0000000003326f0 mark_buffer_dirty+0x1f0/0x210
[c0000000fdb93a60] c000000000334ff0 __block_commit_write.isra.7+0xf0/0x170
[c0000000fdb93ad0] c00000000033513c block_write_end+0x7c/0x100
[c0000000fdb93b20] c00000000033a340 blkdev_write_end+0x60/0xa0
[c0000000fdb93b80] c00000000022d340 generic_perform_write+0x180/0x280
[c0000000fdb93c20] c00000000022f568 __generic_file_write_iter+0x208/0x250
[c0000000fdb93c80] c00000000033b498 blkdev_write_iter+0x98/0x160
[c0000000fdb93cf0] c0000000002e24a4 new_sync_write+0xc4/0x120
[c0000000fdb93d90] c0000000002e32a0 vfs_write+0xc0/0x230
[c0000000fdb93de0] c0000000002e42dc SyS_write+0x6c/0x110
[c0000000fdb93e30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff944c6728
SP (3ffef9ffe0c0) is in userspace
The CPU 6 hold the inode->i_lock in the call to inode_to_wb_and_lock_list().
Why inode->i_wb is null ?
== Comment: #11 - Laurent Dufour - 2016-11-25 11:57:50 ==
I found that lucky03 hit the panic also.
I took a closer look and it seems that there is a lock / memory barrier issue around between the code run in locked_inode_to_wb_and_lock_list() and another CPU. I found that the CPU 5 was running 'latest_blast' at the time the CPU 0 hit the panic. The same applied on lucky02.
== Comment: #13 - Laurent Dufour - 2016-12-05 07:32:30 ==
I did some test on luckyv05 and I was able to recreate it on 4.8 vanilla kernel:
[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0 at 0000000000000000
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620]
pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590
sp: c0000000fb65f8a0
msr: 800000010280b033
dar: 0
dsisr: 40000000
current = 0xc0000001d69be400
paca = 0xc000000003480000 softe: 0 irq_happened: 0x01
pid = 18689, comm = kworker/u16:10
Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016
So this is not a Ubuntu's issue but a more global one which is not fixed by the patch
https://patchwork.kernel.org/patch/9247955/
as expected while investigating the bug 142781.
== Comment: #17 - Laurent Dufour - 2016-12-07 03:22:05 ==
For the record, I also hit the bug with 4.9-rc8:
4:mon> t
[c000000012a7f900] c0000000003787cc writeback_sb_inodes+0x30c/0x590
[c000000012a7fa10] c000000000378b34 __writeback_inodes_wb+0xe4/0x150
[c000000012a7fa70] c000000000378f9c wb_writeback+0x30c/0x450
[c000000012a7fb40] c000000000379df8 wb_workfn+0x268/0x580
[c000000012a7fc50] c0000000000f8c20 process_one_work+0x1e0/0x590
[c000000012a7fce0] c0000000000f9078 worker_thread+0xa8/0x650
[c000000012a7fd80] c000000000101a30 kthread+0x110/0x130
[c000000012a7fe30] c00000000000c0e8 ret_from_kernel_thread+0x5c/0x74
4:mon> e
cpu 0x4: Vector: 300 (Data Access) at [c000000012a7f620]
pc: c000000000374f40: locked_inode_to_wb_and_lock_list+0x50/0x290
lr: c0000000003787cc: writeback_sb_inodes+0x30c/0x590
sp: c000000012a7f8a0
msr: 800000010280b033
dar: 0
dsisr: 40000000
current = 0xc000000011540000
paca = 0xc000000003482400 softe: 0 irq_happened: 0x01
pid = 8357, comm = kworker/u16:3
Linux version 4.9.0-rc8 (root@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Tue Dec 6 05:17:47 CST 2016
== Comment: #24 - Thiago Jung Bauermann - 2017-01-11 16:09:45 ==
Dan Willians posted on 01/06 a patch series which aims to solve this bug:
https://www.spinics.net/lists/linux-fsdevel/msg106092.html
Unfortunately, the kernel test robot found problems with it:
http://lkml.iu.edu/hypermail/linux/kernel/1701.1/00239.html
Still, I think it's useful to perform tests to confirm that:
1. v4.10 is still affected by the problem and
2. Dan's patches fix this bug.
Therefore, could you please reproduce this bug on the unmodified
v4.10-rc3 build below?
http://kernel.stglabs.ibm.com/~bauermann/bug149014/v4.10-rc3/
This will allow us to confirm point 1.
Then, can you please try to reproduce it with the build below?
http://kernel.stglabs.ibm.com/~bauermann/bug149014/fix-
backing_dev_info-lifetime-v2/
This one is v4.10-rc3 plus Dan Willian's two patches from my link
above applied to it.
== Comment: #28 - Lata Kuntal - 2017-01-16 01:34:05 ==
I am seeing the same crash issue on one of UbuntuKVM 16.04.02 guest gusg8.
Pasting the console logs below :
root@guskvm:~# virsh console gusg8 --force
Connected to domain gusg8
Escape character is ^]
0:mon>
0:mon>
0:mon> t
[c00000023d1ab900] c00000000036a41c writeback_sb_inodes+0x30c/0x590
[c00000023d1aba10] c00000000036a784 __writeback_inodes_wb+0xe4/0x150
[c00000023d1aba70] c00000000036abfc wb_writeback+0x30c/0x450
[c00000023d1abb40] c00000000036ba38 wb_workfn+0x268/0x580
[c00000023d1abc50] c0000000000ef5e8 process_one_work+0x1e8/0x5b0
[c00000023d1abce0] c0000000000efa58 worker_thread+0xa8/0x650
[c00000023d1abd80] c0000000000f8224 kthread+0x114/0x140
[c00000023d1abe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0 at 0000000000000000
0:mon>
0:mon>
0:mon> d
0000000000000000 **************** **************** | |
0:mon> r
R00 = c00000000036a41c R16 = c00000027ca0e868
R01 = c00000023d1ab8a0 R17 = c00000027ca0e7e0
R02 = c0000000014a6600 R18 = c00000027ca0e8d0
R03 = c00000027ca0e7e0 R19 = 0000000000000000
R04 = c0000001b092e710 R20 = 0000000000000000
R05 = 0000000000000000 R21 = c00000023d1a8000
R06 = 000000027ee30000 R22 = c000000273aace50
R07 = 00001d0c11165f1a R23 = c000000273aace30
R08 = 0000000000000000 R24 = 0000000000000000
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000080000000 R26 = c00000027ca0e868
R11 = c0000000014daae0 R27 = 0000000000000000
R12 = 0000000000005500 R28 = 0000000000000001
R13 = c00000000fb80000 R29 = c00000027ca0e7e0
R14 = c0000000000f8118 R30 = c00000023d1abba0
R15 = 0000000000000000 R31 = 0000000000000000
pc = c000000000366be4 locked_inode_to_wb_and_lock_list+0x54/0x290
cfar= d000000004bbf2e4 xfs_buf_delwri_submit_buffers+0x1e4/0x2b0 [xfs]
lr = c00000000036a41c writeback_sb_inodes+0x30c/0x590
msr = 800000010280b033 cr = 24aa2882
ctr = c000000000122210 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
0:mon> c
cpus stopped: 0x0-0x3
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000023d1ab620]
pc: c000000000366be4: locked_inode_to_wb_and_lock_list+0x54/0x290
lr: c00000000036a41c: writeback_sb_inodes+0x30c/0x590
sp: c00000023d1ab8a0
msr: 800000010280b033
dar: 0
dsisr: 40000000
current = 0xc0000001b092dc00
paca = 0xc00000000fb80000 softe: 0 irq_happened: 0x01
pid = 774, comm = kworker/u8:3
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
0:mon>
== Comment: #33 - Thiago Jung Bauermann - 2017-01-23 15:31:24 ==
Lekshmi mentioned that she wasn't able to reproduce this bug with kernel 4.10.0-rc3fixlifetime+, so I replied to Dan's patch series mentioning that it fixes this bug:
https://www.spinics.net/lists/linux-fsdevel/msg106830.html
Let's see if he answers back with a status or thoughts regarding the
patch series.
== Comment: #34 - LEKSHMI C. PILLAI - 2017-01-24 00:26:22 ==
Hi
The fix worked with 4.10.0-rc3fixlifetime+ kernel.Need to know which
kernel the fix is going to be.and whether able to get the workaround
for 16.04.02 ie; kernel 4.8
Thanks
Lekshmi
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1659111/+subscriptions