group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1659111] Re: UbuntuKVM guest crashed while running I/O stress test with Ubuntu kernel 4.4.0-47-generic

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1659111@xxxxxxxxxxxxxxxxxx>
Date: Tue, 16 May 2017 11:20:41 -0000
Reply-to: Bug 1659111 <1659111@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
This bug was fixed in the package linux - 4.4.0-78.99

---------------
linux (4.4.0-78.99) xenial; urgency=low

  * linux: 4.4.0-78.99 -proposed tracker (LP: #1686645)

  * Please backport fix to reference leak in cgroup blkio throttle
    (LP: #1683976)
    - block: fix module reference leak on put_disk() call for cgroups throttle

  * UbuntuKVM guest crashed while running I/O stress test with Ubuntu kernel
    4.4.0-47-generic (LP: #1659111)
    - block: Unhash block device inodes on gendisk destruction
    - block: Use pointer to backing_dev_info from request_queue
    - block: Dynamically allocate and refcount backing_dev_info
    - block: Make blk_get_backing_dev_info() safe without open bdev
    - block: Get rid of blk_get_backing_dev_info()
    - block: Move bdev_unhash_inode() after invalidate_partition()
    - block: Unhash also block device inode for the whole device
    - block: Revalidate i_bdev reference in bd_aquire()
    - block: Initialize bd_bdi on inode initialization
    - block: Move bdi_unregister() to del_gendisk()
    - block: Allow bdi re-registration
    - bdi: Fix use-after-free in wb_congested_put()
    - block: Make del_gendisk() safer for disks without queues
    - block: Fix bdi assignment to bdev inode when racing with disk delete
    - bdi: Mark congested->bdi as internal
    - bdi: Make wb->bdi a proper reference
    - bdi: Unify bdi->wb_list handling for root wb_writeback
    - bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
    - bdi: Do not wait for cgwbs release in bdi_unregister()
    - bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
    - block: Fix oops in locked_inode_to_wb_and_lock_list()
    - kobject: Export kobject_get_unless_zero()
    - block: Fix oops scsi_disk_get()

  * Touchpad not working correctly after kernel upgrade (LP: #1662589)
    - Input: ALPS - fix V8+ protocol handling (73 03 28)

  * Xenial update to v4.4.62 stable release (LP: #1683728)
    - drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
    - drm/i915: Stop using RP_DOWN_EI on Baytrail
    - usb: dwc3: gadget: delay unmap of bounced requests
    - mtd: bcm47xxpart: fix parsing first block after aligned TRX
    - MIPS: Introduce irq_stack
    - MIPS: Stack unwinding while on IRQ stack
    - MIPS: Only change $28 to thread_info if coming from user mode
    - MIPS: Switch to the irq_stack in interrupts
    - MIPS: Select HAVE_IRQ_EXIT_ON_IRQ_STACK
    - MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch
    - crypto: caam - fix RNG deinstantiation error checking
    - Linux 4.4.62

  * ifup service of network device stay active after driver stop (LP: #1672144)
    - net: use net->count to check whether a netns is alive or not

  * [Hyper-V] mkfs regression in kernel 4.4+ (LP: #1682215)
    - block: relax check on sg gap

  * [Feature] KBL: intel_powerclamp driver support (LP: #1591641)
    - thermal/powerclamp: remove cpu whitelist
    - thermal/powerclamp: correct cpu support check
    - thermal/powerclamp: add back module device table

  * sysfs channel reads of lps22hb pressure sensor are stale (LP: #1682103)
    - iio: st_pressure: initialize lps22hb bootime

  * Backlight control does not work and there are no entries in
    /sys/class/backlight (LP: #1667323)
    - Revert "ACPI / video: Add force_native quirk for HP Pavilion dv6"

  * [Feature] KBL: intel_rapl driver support (LP: #1591640)
    - powercap/intel_rapl: Add support for Kabylake

  * Xenial update to v4.4.61 stable release (LP: #1682140)
    - drm/vmwgfx: Type-check lookups of fence objects
    - drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
    - drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
    - drm/ttm, drm/vmwgfx: Relax permission checking when opening surfaces
    - drm/vmwgfx: Remove getparam error message
    - drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
    - sysfs: be careful of error returns from ops->show()
    - staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
    - arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
    - arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
    - iio: bmg160: reset chip when probing
    - Reset TreeId to zero on SMB2 TREE_CONNECT
    - ptrace: fix PTRACE_LISTEN race corrupting task->state
    - ring-buffer: Fix return value check in test_ringbuffer()
    - metag/usercopy: Drop unused macros
    - metag/usercopy: Fix alignment error checking
    - metag/usercopy: Add early abort to copy_to_user
    - metag/usercopy: Zero rest of buffer from copy_from_user
    - metag/usercopy: Set flags before ADDZ
    - metag/usercopy: Fix src fixup in from user rapf loops
    - metag/usercopy: Add missing fixups
    - powerpc/mm: Add missing global TLB invalidate if cxl is active
    - powerpc: Don't try to fix up misaligned load-with-reservation instructions
    - nios2: reserve boot memory for device tree
    - s390/decompressor: fix initrd corruption caused by bss clear
    - s390/uaccess: get_user() should zero on failure (again)
    - MIPS: Force o32 fp64 support on 32bit MIPS64r6 kernels
    - MIPS: ralink: Fix typos in rt3883 pinctrl
    - MIPS: End spinlocks with .insn
    - MIPS: Lantiq: fix missing xbar kernel panic
    - MIPS: Flush wrong invalid FTLB entry for huge page
    - mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
    - Linux 4.4.61

  * Xenial update to v4.4.60 stable release (LP: #1681862)
    - libceph: force GFP_NOIO for socket allocations
    - xen/setup: Don't relocate p2m over existing one
    - scsi: mpt3sas: fix hang on ata passthrough commands
    - scsi: sg: check length passed to SG_NEXT_CMD_LEN
    - scsi: libsas: fix ata xfer length
    - ALSA: seq: Fix race during FIFO resize
    - ALSA: hda - fix a problem for lineout on a Dell AIO machine
    - ASoC: atmel-classd: fix audio clock rate
    - ACPI: Fix incompatibility with mcount-based function graph tracing
    - ACPI: Do not create a platform_device for IOAPIC/IOxAPIC
    - tty/serial: atmel: fix race condition (TX+DMA)
    - tty/serial: atmel: fix TX path in atmel_console_write()
    - USB: fix linked-list corruption in rh_call_control()
    - KVM: x86: clear bus pointer when destroyed
    - drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
    - mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
    - MIPS: Lantiq: Fix cascaded IRQ setup
    - rtc: s35390a: fix reading out alarm
    - rtc: s35390a: make sure all members in the output are set
    - rtc: s35390a: implement reset routine as suggested by the reference
    - rtc: s35390a: improve irq handling
    - KVM: kvm_io_bus_unregister_dev() should never fail
    - power: reset: at91-poweroff: timely shutdown LPDDR memories
    - blk: improve order of bio handling in generic_make_request()
    - blk: Ensure users for current->bio_list can see the full list.
    - padata: avoid race in reordering
    - Linux 4.4.60

 -- Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx>  Thu, 27 Apr
2017 10:24:08 -0300

** Changed in: linux (Ubuntu Xenial)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1659111

Title:
  UbuntuKVM guest crashed while running I/O stress test with Ubuntu
  kernel  4.4.0-47-generic

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Yakkety:
  In Progress
Status in linux source package in Zesty:
  In Progress

Bug description:
  Attn. Canonical: For your awareness only at this time.

  == Comment: #0 - LEKSHMI C. PILLAI  - 2016-11-22 03:49:38 ==

  Machine INFO

  KVM HOST: luckyv1

  Guest :lucky05

  lucky05 crashed while running the I/O stress test for SAN disks.

  Installed lucky05 and enabled the xmon on that.After that started the
  RAW disk test on around 50 disks.After 6-7 hours after running,Now
  machine dropped into xmon.

  Logs:
  [25023.224182] Unable to handle kernel paging request for data at address 0x00000000
  [25023.224257] Faulting instruction address: 0xc000000000324c60
  cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3620]
      pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
      lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
      sp: c0000000fffc38a0
     msr: 8000000100009033
     dar: 0
   dsisr: 40000000
    current = 0xc0000000ff99e470
    paca    = 0xc00000000fb41c80   softe: 0        irq_happened: 0x01
      pid   = 14736, comm = kworker/u16:8
  enter ? for help
  [c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
  [c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
  3:mon> f
  3:mon> th
  [c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
  [c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4         
  3:mon> sh
  [27384.651055] INFO: rcu_sched detected stalls on CPUs/tasks:
  [27384.651220]  (detected by 4, t=40598 jiffies, g=2849830, c=2849829, q=992)
  [27384.651286] All QSes seen, last rcu_sched kthread activity 40596 (4301188714-4301148118), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [27384.651501] rcu_sched kthread starved for 40596 jiffies! g2849830 c2849829 f0x2 s3 ->state=0x0
  [27384.651747] INFO: rcu_sched detected stalls on CPUs/tasks:
  [27384.651905]  (detected by 4, t=590354 jiffies, g=2849830, c=2849829, q=1285)
  [27384.652012] All QSes seen, last rcu_sched kthread activity 590352 (4301738470-4301148118), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [27384.652191] rcu_sched kthread starved for 590352 jiffies! g2849830 c2849829 f0x2 s3 ->state=0x0
  [27384.730645] Unable to handle kernel paging request for data at address 0xffffffffffffffd8
  [27384.730781] Faulting instruction address: 0xc0000000000e7258
  cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3000]
      pc: c0000000000e7258: kthread_data+0x28/0x40
      lr: c0000000000de940: wq_worker_sleeping+0x30/0x110
      sp: c0000000fffc3280
     msr: 8000000100009033
     dar: ffffffffffffffd8
   dsisr: 40000000
    current = 0xc0000000ff99e470
    paca    = 0xc00000000fb41c80   softe: 0        irq_happened: 0x01
      pid   = 14736, comm = kworker/u16:8
  enter ? for help                

  == Comment: #1 - LEKSHMI C. PILLAI - 2016-11-22 04:05:41 ==
  3:mon> th
  [c0000000fffc32b0] c0000000000de940 wq_worker_sleeping+0x30/0x110
  [c0000000fffc32f0] c000000000af31bc __schedule+0x6ec/0x990
  [c0000000fffc33c0] c000000000af34a8 schedule+0x48/0xc0
  [c0000000fffc33f0] c0000000000bd3d0 do_exit+0x760/0xc30
  [c0000000fffc34b0] c000000000020bf4 die+0x314/0x470
  [c0000000fffc3540] c000000000050d98 bad_page_fault+0xd8/0x150
  [c0000000fffc35b0] c000000000008680 handle_page_fault+0x2c/0x30
  --- Exception: 300 (Data Access) at c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
  [c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
  [c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
  3:mon>

  == Comment: #6 - Laurent Dufour - 2016-11-23 03:00:16 ==
  Logged in luckyv1, found a lot of ipr issue on this node:
  [525973.896624] qla2xxx 0005:09:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
  [525973.956619] qla2xxx 0005:09:00.1: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
  [529433.834853] ipr 0001:04:00.0: FFFE: Soft device bus error recovered by the IOA
  [529433.834867] ipr: -----Failing Device Information-----
  [529433.834870] ipr: World Wide Unique ID: 500507605EC10C000000000000000000
  [529433.834873] ipr: Device Resource Path: FF
  [529433.834875] ipr: Primary Problem Description: Command Timeout                
  [529433.834878] ipr: Secondary Problem Description:  Command timeout expired        
  [529433.834880] ipr: SCSI Sense Data:
  [529433.834882] ipr: 00000000: 00000000 00000000 00000000 00000000
  [529433.834884] ipr: 00000010: 00000000 00000000 00000000 00000000
  [529433.834886] ipr: SCSI Command Descriptor Block: 
  [529433.834889] ipr: 00000000: 9E120004 0F000000 00000000 0020AD00
  [529433.834891] ipr: Additional IOA Data:
  [529433.834893] ipr: 00000000: 4646001C 44010007 00050000 04700002
  [529433.834895] ipr: 00000010: 3B894A49 1EE620CC 04700002 49574631
  [529433.834897] ipr: 00000020: 455300CC 06B00027 00000020 84000000
  [529433.834899] ipr: 00000030: 00000000 05801000 0B29A7C0 00000000
  [529433.834901] ipr: 00000040: 00000000 00000000 00000000 00000000
  [529433.834904] ipr: 00000050: 00000000 00000000 00000000 00000000
  [529433.834906] ipr: 00000060: 00000000 00000000 00000000 00000000
  [529433.834908] ipr: 00000070: 00000000 00000000 00000000 00000000
  [529433.834910] ipr: 00000080: 00000000 00000000 00000000 00000000
  [529433.834912] ipr: 00000090: 00000000 00000000 00000000 00000000
  [529433.834914] ipr: 000000A0: 00000000 D4000018 80000000 FFFFFFFF
  [529433.834917] ipr: 000000B0: FFFFFFFF 00000000 0980EC21 00000000
  [529433.834919] ipr: 000000C0: 00000000 00000000 01769A24 00000000
  [529433.834921] ipr: 000000D0: 01D3C300 E0050000 FFFFFFFE 0B5A0000
  [529433.834923] ipr: 000000E0: 00000000 9E120004 0F000000 00000000
  [529433.834926] ipr: 000000F0: 43440010 9E120004 0F000000 00000000
  [529433.834928] ipr: 00000100: 0020AD00 45480010 0100E038 9E12FFFF
  [529433.834930] ipr: 00000110: 01080002 00000000 45540004 00001463

  In addition there are some NFS issue reported:
  [563034.817901] nfs: server 10.33.11.31 not responding, timed out
  [563405.504308] nfs: server 10.33.11.31 not responding, timed out

  This said, chig5 enter xmon due to a bad pointer in the kernel:
  3:mon> e
  cpu 0x3: Vector: 300 (Data Access) at [c0000000fffc3000]
      pc: c0000000000e7258: kthread_data+0x28/0x40
      lr: c0000000000de940: wq_worker_sleeping+0x30/0x110
      sp: c0000000fffc3280
     msr: 8000000100009033
     dar: ffffffffffffffd8
   dsisr: 40000000
    current = 0xc0000000ff99e470
    paca    = 0xc00000000fb41c80	 softe: 0	 irq_happened: 0x01
      pid   = 14736, comm = kworker/u16:8
  3:mon> th
  [c0000000fffc32b0] c0000000000de940 wq_worker_sleeping+0x30/0x110
  [c0000000fffc32f0] c000000000af31bc __schedule+0x6ec/0x990
  [c0000000fffc33c0] c000000000af34a8 schedule+0x48/0xc0
  [c0000000fffc33f0] c0000000000bd3d0 do_exit+0x760/0xc30
  [c0000000fffc34b0] c000000000020bf4 die+0x314/0x470
  [c0000000fffc3540] c000000000050d98 bad_page_fault+0xd8/0x150
  [c0000000fffc35b0] c000000000008680 handle_page_fault+0x2c/0x30
  --- Exception: 300 (Data Access) at c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
  [c0000000fffc3900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000000fffc3a10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000000fffc3a70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000000fffc3b40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000000fffc3c50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000000fffc3ce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000000fffc3d80] c0000000000e6980 kthread+0x110/0x130
  [c0000000fffc3e30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4

  Looking at the other guest as Lekshmi mentioned that all the guests
  are crashing.

  == Comment: #7 - Laurent Dufour - 2016-11-23 03:24:34 ==
  The guest lucky01 (4.4.0-47-generic) is fine :
  root@lucky01:/Blast# date
  Wed Nov 23 03:04:23 CST 2016

  The guest lucky02 (4.4.0-47generic) has entered xmon due to the same issue as lukcy05:
  7:mon> e
  cpu 0x7: Vector: 300 (Data Access) at [c0000001f265b620]
      pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
      lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
      sp: c0000001f265b8a0
     msr: 8000000100009033
     dar: 0
   dsisr: 40000000
    current = 0xc0000001f222fcc0
    paca    = 0xc00000000fb44280	 softe: 0	 irq_happened: 0x01
      pid   = 12062, comm = kworker/u16:3
  7:mon> t
  [c0000001f265b900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000001f265ba10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000001f265ba70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000001f265bb40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000001f265bc50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000001f265bce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000001f265bd80] c0000000000e6980 kthread+0x110/0x130
  [c0000001f265be30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
  --- Exception: 0  at 0000000000000000

  The guest lucky03 didn't enter xmon but is not responding any more. Unfornately sysrq is not enabled on this guest. There are still some activity on this guest.
  root@luckyv1:~# virsh qemu-monitor-command --hmp lucky03 'info cpus'
  * CPU #0: nip=0xc0000000001035e0 thread_id=76434
    CPU #1: nip=0xc0000000000863dc thread_id=76435
    CPU #2: nip=0xc0000000000863dc thread_id=76436
    CPU #3: nip=0xc0000000000863dc thread_id=76437
    CPU #4: nip=0xc0000000000863dc thread_id=76439
    CPU #5: nip=0xc0000000000863dc thread_id=76440
    CPU #6: nip=0x0000000010072f68 thread_id=76441
    CPU #7: nip=0xc0000000000863dc thread_id=76442

  
  The guest lucky04 is not responding but neither enter xmon, but sysrq are not enabled on this node.
  But the node seems to be still active:
  root@luckyv1:~# virsh qemu-monitor-command --hmp lucky04 'info cpus'
  * CPU #0: nip=0xc000000000af8834 thread_id=68201
    CPU #1: nip=0xc0000000000863dc thread_id=68202
    CPU #2: nip=0xc0000000000645ac thread_id=68203
    CPU #3: nip=0xc0000000000863dc thread_id=68204
    CPU #4: nip=0xc0000000000863dc thread_id=68205
    CPU #5: nip=0xc0000000000863dc thread_id=68206
    CPU #6: nip=0xc000000000064590 thread_id=68207
    CPU #7: nip=0xc000000000af8904 thread_id=68208

  The guest lucky06 is alive:
  root@lucky06:/# cat /proc/version; date
  Linux version 4.4.0-47-generic (buildd@bos01-ppc64el-008) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #68-Ubuntu SMP Wed Oct 26 19:38:24 UTC 2016
  Wed Nov 23 03:20:19 CST 2016

  To summarize:
  lucky01  good
  lucky02  panic in locked_inode_to_wb_and_lock_list()
  lucky03  not responding but still active
  lucky04  not responding but still active
  lucky05  panic in locked_inode_to_wb_and_lock_list()
  lucky06  good

  == Comment: #10 - Laurent Dufour - 2016-11-24 10:27:52 ==
  Here the data I captured on lucky02 which did panic the way lucky05 did.

  CPU 7 panic due to a data access error:
   7:mon> e
  cpu 0x7: Vector: 300 (Data Access) at [c0000001f265b620]
      pc: c000000000324c60: locked_inode_to_wb_and_lock_list+0x50/0x290
      lr: c00000000032831c: writeback_sb_inodes+0x30c/0x590
      sp: c0000001f265b8a0
     msr: 8000000100009033
     dar: 0
   dsisr: 40000000
    current = 0xc0000001f222fcc0
    paca    = 0xc00000000fb44280	 softe: 0	 irq_happened: 0x01
      pid   = 12062, comm = kworker/u16:3
  7:mon> r
  R00 = c00000000032831c   R16 = c0000001fc972ef8
  R01 = c0000001f265b8a0   R17 = c0000001fc972e70
  R02 = c0000000015c6a00   R18 = c0000001fc972f60
  R03 = c0000001fc972e70   R19 = 0000000000000000
  R04 = c0000001f2230700   R20 = 0000000000000000
  R05 = 0000000000000000   R21 = c0000001f2658000
  R06 = 00000001fef30000   R22 = c0000001f35d5c88
  R07 = 000108f684c40713   R23 = c0000001f35d5c68
  R08 = 0000000000000000   R24 = 0000000000000000
  R09 = 0000000000000000   R25 = c0000001fc972ef8
  R10 = 0000000080000007   R26 = 0000000000000000
  R11 = 00000000030883ec   R27 = 0000000000000000
  R12 = 0000000000000000   R28 = 0000000000000001
  R13 = c00000000fb44280   R29 = c0000001fc972e70
  R14 = c0000000000e6878   R30 = c0000001f265bba0
  R15 = 0000000000000000   R31 = 0000000000000000 
  pc  = c000000000324c60 locked_inode_to_wb_and_lock_list+0x50/0x290
  cfar= 00003fff9647a5a8
  lr  = c00000000032831c writeback_sb_inodes+0x30c/0x590
  msr = 8000000100009033   cr  = 24652882
  ctr = c000000000110b50   xer = 0000000020000000   trap =  300
  dar = 0000000000000000   dsisr = 40000000
  7:mon> t 
  [c0000001f265b900] c00000000032831c writeback_sb_inodes+0x30c/0x590
  [c0000001f265ba10] c000000000328684 __writeback_inodes_wb+0xe4/0x150
  [c0000001f265ba70] c000000000328aec wb_writeback+0x30c/0x450
  [c0000001f265bb40] c0000000003296b4 wb_workfn+0x264/0x570
  [c0000001f265bc50] c0000000000dd930 process_one_work+0x1e0/0x5a0
  [c0000001f265bce0] c0000000000dde84 worker_thread+0x194/0x680
  [c0000001f265bd80] c0000000000e6980 kthread+0x110/0x130
  [c0000001f265be30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4

  The system tried to access data pointed by r31 which contains data retrieved from the inode address stored in r29.
  The panic happened during the inline call to wb_get when inode->i_wb is used.
  So here inode->i_wb is null which is not expeted to happen.

  At this time, CPU 6 is waiting for the same inode's spinlock inode->i_lock to be released here:
  6:mon> t
  [link register   ] c000000000064624 __spin_yield+0xb4/0xc0
  [c0000000fdb93900] c0000000fdb93940 (unreliable)
  [c0000000fdb93970] c000000000af8968 _raw_spin_lock+0xd8/0xe0
  [c0000000fdb939a0] c000000000327330 __mark_inode_dirty+0xd0/0x4a0
  [c0000000fdb93a20] c0000000003326f0 mark_buffer_dirty+0x1f0/0x210
  [c0000000fdb93a60] c000000000334ff0 __block_commit_write.isra.7+0xf0/0x170
  [c0000000fdb93ad0] c00000000033513c block_write_end+0x7c/0x100
  [c0000000fdb93b20] c00000000033a340 blkdev_write_end+0x60/0xa0
  [c0000000fdb93b80] c00000000022d340 generic_perform_write+0x180/0x280
  [c0000000fdb93c20] c00000000022f568 __generic_file_write_iter+0x208/0x250
  [c0000000fdb93c80] c00000000033b498 blkdev_write_iter+0x98/0x160
  [c0000000fdb93cf0] c0000000002e24a4 new_sync_write+0xc4/0x120
  [c0000000fdb93d90] c0000000002e32a0 vfs_write+0xc0/0x230
  [c0000000fdb93de0] c0000000002e42dc SyS_write+0x6c/0x110
  [c0000000fdb93e30] c000000000009204 system_call+0x38/0xb4
  --- Exception: c01 (System Call) at 00003fff944c6728
  SP (3ffef9ffe0c0) is in userspace

  The CPU 6 hold the inode->i_lock in the call to  inode_to_wb_and_lock_list().
  Why inode->i_wb is null ?

  == Comment: #11 - Laurent Dufour - 2016-11-25 11:57:50 ==
  I found that lucky03 hit the panic also.
  I took a closer look and it seems that there is a lock / memory barrier issue around between the code run in locked_inode_to_wb_and_lock_list() and another CPU. I found that the CPU 5 was running 'latest_blast' at the time the CPU 0 hit the panic. The same applied on lucky02.

  == Comment: #13 - Laurent Dufour - 2016-12-05 07:32:30 ==
  I did some test on luckyv05 and I was able to recreate it on 4.8 vanilla kernel:
  [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
  [113031.075614] Faulting instruction address: 0xc0000000003692e0
  0:mon> t
  [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
  [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
  [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
  [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
  [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
  [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
  [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
  [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
  --- Exception: 0  at 0000000000000000
  0:mon> e
  cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620]
      pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
      lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590
      sp: c0000000fb65f8a0
     msr: 800000010280b033
     dar: 0
   dsisr: 40000000
    current = 0xc0000001d69be400
    paca    = 0xc000000003480000	 softe: 0	 irq_happened: 0x01
      pid   = 18689, comm = kworker/u16:10
  Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016

  So this is not a Ubuntu's issue but a more global one which is not fixed by the patch 
  https://patchwork.kernel.org/patch/9247955/ 
  as expected while investigating the bug 142781.

  == Comment: #17 - Laurent Dufour - 2016-12-07 03:22:05 ==
  For the record, I also hit the bug with 4.9-rc8:
  4:mon> t
  [c000000012a7f900] c0000000003787cc writeback_sb_inodes+0x30c/0x590
  [c000000012a7fa10] c000000000378b34 __writeback_inodes_wb+0xe4/0x150
  [c000000012a7fa70] c000000000378f9c wb_writeback+0x30c/0x450
  [c000000012a7fb40] c000000000379df8 wb_workfn+0x268/0x580
  [c000000012a7fc50] c0000000000f8c20 process_one_work+0x1e0/0x590
  [c000000012a7fce0] c0000000000f9078 worker_thread+0xa8/0x650
  [c000000012a7fd80] c000000000101a30 kthread+0x110/0x130
  [c000000012a7fe30] c00000000000c0e8 ret_from_kernel_thread+0x5c/0x74
  4:mon> e
  cpu 0x4: Vector: 300 (Data Access) at [c000000012a7f620]
      pc: c000000000374f40: locked_inode_to_wb_and_lock_list+0x50/0x290
      lr: c0000000003787cc: writeback_sb_inodes+0x30c/0x590
      sp: c000000012a7f8a0
     msr: 800000010280b033
     dar: 0
   dsisr: 40000000
    current = 0xc000000011540000
    paca    = 0xc000000003482400	 softe: 0	 irq_happened: 0x01
      pid   = 8357, comm = kworker/u16:3
  Linux version 4.9.0-rc8 (root@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Tue Dec 6 05:17:47 CST 2016

  == Comment: #24 - Thiago Jung Bauermann - 2017-01-11 16:09:45 ==
  Dan Willians posted on 01/06 a patch series which aims to solve this bug:

  https://www.spinics.net/lists/linux-fsdevel/msg106092.html

  Unfortunately, the kernel test robot found problems with it:

  http://lkml.iu.edu/hypermail/linux/kernel/1701.1/00239.html

  Still, I think it's useful to perform tests to confirm that:

  1. v4.10 is still affected by the problem and
  2. Dan's patches fix this bug.

  Therefore, could you please reproduce this bug on the unmodified
  v4.10-rc3 build below?

  http://kernel.stglabs.ibm.com/~bauermann/bug149014/v4.10-rc3/

  This will allow us to confirm point 1.

  Then, can you please try to reproduce it with the build below?

  http://kernel.stglabs.ibm.com/~bauermann/bug149014/fix-
  backing_dev_info-lifetime-v2/

  This one is v4.10-rc3 plus Dan Willian's two patches from my link
  above applied to it.

  == Comment: #28 - Lata Kuntal - 2017-01-16 01:34:05 ==
  I am seeing the same crash issue on one of UbuntuKVM 16.04.02 guest gusg8.
  Pasting the console logs below :

  root@guskvm:~# virsh console gusg8 --force
  Connected to domain gusg8
  Escape character is ^]

  0:mon>
  0:mon>
  0:mon> t
  [c00000023d1ab900] c00000000036a41c writeback_sb_inodes+0x30c/0x590
  [c00000023d1aba10] c00000000036a784 __writeback_inodes_wb+0xe4/0x150
  [c00000023d1aba70] c00000000036abfc wb_writeback+0x30c/0x450
  [c00000023d1abb40] c00000000036ba38 wb_workfn+0x268/0x580
  [c00000023d1abc50] c0000000000ef5e8 process_one_work+0x1e8/0x5b0
  [c00000023d1abce0] c0000000000efa58 worker_thread+0xa8/0x650
  [c00000023d1abd80] c0000000000f8224 kthread+0x114/0x140
  [c00000023d1abe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
  --- Exception: 0  at 0000000000000000
  0:mon>
  0:mon>
  0:mon> d
  0000000000000000 **************** ****************  |                |
  0:mon> r
  R00 = c00000000036a41c   R16 = c00000027ca0e868
  R01 = c00000023d1ab8a0   R17 = c00000027ca0e7e0
  R02 = c0000000014a6600   R18 = c00000027ca0e8d0
  R03 = c00000027ca0e7e0   R19 = 0000000000000000
  R04 = c0000001b092e710   R20 = 0000000000000000
  R05 = 0000000000000000   R21 = c00000023d1a8000
  R06 = 000000027ee30000   R22 = c000000273aace50
  R07 = 00001d0c11165f1a   R23 = c000000273aace30
  R08 = 0000000000000000   R24 = 0000000000000000
  R09 = 0000000000000000   R25 = 0000000000000000
  R10 = 0000000080000000   R26 = c00000027ca0e868
  R11 = c0000000014daae0   R27 = 0000000000000000
  R12 = 0000000000005500   R28 = 0000000000000001
  R13 = c00000000fb80000   R29 = c00000027ca0e7e0
  R14 = c0000000000f8118   R30 = c00000023d1abba0
  R15 = 0000000000000000   R31 = 0000000000000000
  pc  = c000000000366be4 locked_inode_to_wb_and_lock_list+0x54/0x290
  cfar= d000000004bbf2e4 xfs_buf_delwri_submit_buffers+0x1e4/0x2b0 [xfs]
  lr  = c00000000036a41c writeback_sb_inodes+0x30c/0x590
  msr = 800000010280b033   cr  = 24aa2882
  ctr = c000000000122210   xer = 0000000020000000   trap =  300
  dar = 0000000000000000   dsisr = 40000000
  0:mon> c
  cpus stopped: 0x0-0x3
  0:mon> e
  cpu 0x0: Vector: 300 (Data Access) at [c00000023d1ab620]
      pc: c000000000366be4: locked_inode_to_wb_and_lock_list+0x54/0x290
      lr: c00000000036a41c: writeback_sb_inodes+0x30c/0x590
      sp: c00000023d1ab8a0
     msr: 800000010280b033
     dar: 0
   dsisr: 40000000
    current = 0xc0000001b092dc00
    paca    = 0xc00000000fb80000   softe: 0        irq_happened: 0x01
      pid   = 774, comm = kworker/u8:3
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  0:mon>

  
  == Comment: #33 - Thiago Jung Bauermann - 2017-01-23 15:31:24 ==
  Lekshmi mentioned that she wasn't able to reproduce this bug with kernel 4.10.0-rc3fixlifetime+, so I replied to Dan's patch series mentioning that it fixes this bug:

  https://www.spinics.net/lists/linux-fsdevel/msg106830.html

  Let's see if he answers back with a status or thoughts regarding the
  patch series.

  == Comment: #34 - LEKSHMI C. PILLAI  - 2017-01-24 00:26:22 ==
  Hi

  The fix worked with 4.10.0-rc3fixlifetime+   kernel.Need to know which
  kernel the fix is going to be.and whether able to get the workaround
  for 16.04.02 ie; kernel 4.8

  
  Thanks
  Lekshmi

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1659111/+subscriptions