kernel-packages team mailing list archive

Thread
Date
[Bug 1349711] Re: Machine lockup in btrfs-transaction

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Peter Waller <p@xxxxxxxxxxx>
Date: Thu, 31 Jul 2014 14:31:29 -0000
Reply-to: Bug 1349711 <1349711@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
I've got a way to rapidly reproduce the error now. I can do it reliably
with a turnaround time of 5-10 minutes.

I've reproduced the crash on the new Kernel, so it has now been observed
on both 3.13.0-32-generic and  3.15.7-031507-generic. I'll try 3.16
next.

I've also discovered this new stack trace at the point of the crash
(setup_cluster_bitmap) which happens every time:

[97056.916053] NMI backtrace for cpu 1
[97056.916053] CPU: 1 PID: 1107 Comm: kworker/u30:1 Not tainted 3.13.0-32-generic #57-Ubuntu
[97056.916053] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/23/2014
[97056.916053] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[97056.916053] task: ffff8800e8018000 ti: ffff8800da1ac000 task.ti: ffff8800da1ac000
[97056.916053] RIP: 0010:[<ffffffff81044a8c>] [<ffffffff81044a8c>] default_send_IPI_mask_sequence_phys+0xbc/0x100
[97056.916053] RSP: 0018:ffff8801efc23c60 EFLAGS: 00000046
[97056.916053] RAX: 0000000000000400 RBX: 000000000000b06a RCX: 0000000000000001
[97056.916053] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000300
[97056.916053] RBP: ffff8801efc23c98 R08: ffffffff81d13780 R09: 0000000000000d2e
[97056.916053] R10: 0000000000000000 R11: ffff8801efc239e6 R12: ffffffff81d13780
[97056.916053] R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002
[97056.916053] FS: 0000000000000000(0000) GS:ffff8801efc20000(0000) knlGS:0000000000000000
[97056.916053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[97056.916053] CR2: 00007f803f12e000 CR3: 0000000101cc2000 CR4: 00000000001406e0
[97056.916053] Stack:
[97056.916053] 0000000100000008 0000000000000001 0000000000002710 ffffffff81c4e1c0
[97056.916053] ffffffff81d137a0 ffffffff81c4e1c0 0000000000000001 ffff8801efc23ca8
[97056.916053] ffffffff81049217 ffff8801efc23cc0 ffffffff81044c60 ffff8801efc2e800
[97056.916053] Call Trace:
[97056.916053] <IRQ>
[97056.916053] [<ffffffff81049217>] physflat_send_IPI_all+0x17/0x20
[97056.916053] [<ffffffff81044c60>] arch_trigger_all_cpu_backtrace+0x70/0xb0
[97056.916053] [<ffffffff810ca9ee>] rcu_check_callbacks+0x3fe/0x650
[97056.916053] [<ffffffff81076227>] update_process_times+0x47/0x70
[97056.916053] [<ffffffff810d5cf5>] tick_sched_handle.isra.17+0x25/0x60
[97056.916053] [<ffffffff810d5d71>] tick_sched_timer+0x41/0x60
[97056.916053] [<ffffffff8108e5e7>] __run_hrtimer+0x77/0x1d0
[97056.916053] [<ffffffff810d5d30>] ? tick_sched_handle.isra.17+0x60/0x60
[97056.916053] [<ffffffff8108edaf>] hrtimer_interrupt+0xef/0x230
[97056.916053] [<ffffffff81009fef>] xen_timer_interrupt+0x2f/0x150
[97056.916053] [<ffffffff81009ef0>] ? xen_clocksource_read+0x20/0x30
[97056.916053] [<ffffffff8101b7e9>] ? sched_clock+0x9/0x10
[97056.916053] [<ffffffff8109d1ad>] ? sched_clock_local+0x1d/0x80
[97056.916053] [<ffffffff810bf78e>] handle_irq_event_percpu+0x3e/0x1d0
[97056.916053] [<ffffffff810c2bbe>] handle_percpu_irq+0x3e/0x60
[97056.916053] [<ffffffff8142dba7>] __xen_evtchn_do_upcall+0x317/0x320
[97056.916053] [<ffffffff8109d98e>] ? __vtime_account_system+0x2e/0x40
[97056.916053] [<ffffffff8109dd2c>] ? vtime_account_system+0x3c/0x50
[97056.916053] [<ffffffff8142fc8b>] xen_evtchn_do_upcall+0x2b/0x50
[97056.916053] [<ffffffff8172e22d>] xen_hvm_callback_vector+0x6d/0x80
[97056.916053] <EOI>
[97056.916053] [<ffffffff813732df>] ? find_next_zero_bit+0x8f/0xf0
[97056.916053] [<ffffffffa0164f6a>] setup_cluster_bitmap+0x15a/0x280 [btrfs]
[97056.916053] [<ffffffffa0168174>] btrfs_find_space_cluster+0x244/0x290 [btrfs]
[97056.916053] [<ffffffffa0115a32>] find_free_extent+0x4d2/0xc30 [btrfs]
[97056.916053] [<ffffffffa01162b8>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
[97056.916053] [<ffffffffa012f165>] cow_file_range+0x135/0x430 [btrfs]
[97056.916053] [<ffffffffa0130142>] run_delalloc_range+0x312/0x350 [btrfs]
[97056.916053] [<ffffffffa0143f69>] ? find_lock_delalloc_range.constprop.43+0x1b9/0x1f0 [btrfs]
[97056.916053] [<ffffffffa0145134>] __extent_writepage+0x2f4/0x750 [btrfs]
[97056.916053] [<ffffffffa0144010>] ? end_extent_writepage+0x70/0x70 [btrfs]
[97056.916053] [<ffffffffa0145802>] extent_write_cache_pages.isra.30.constprop.50+0x272/0x3d0 [btrfs]
[97056.916053] [<ffffffffa0146acd>] extent_writepages+0x4d/0x70 [btrfs]
[97056.916053] [<ffffffffa012caf0>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
[97056.916053] [<ffffffffa012ac88>] btrfs_writepages+0x28/0x30 [btrfs]
[97056.916053] [<ffffffff8115a5ee>] do_writepages+0x1e/0x40
[97056.916053] [<ffffffff811e5f40>] __writeback_single_inode+0x40/0x210
[97056.916053] [<ffffffff811e6cf7>] writeback_sb_inodes+0x247/0x3e0
[97056.916053] [<ffffffff811e6f2f>] __writeback_inodes_wb+0x9f/0xd0
[97056.916053] [<ffffffff811e71a3>] wb_writeback+0x243/0x2c0
[97056.916053] [<ffffffff811e8a79>] bdi_writeback_workfn+0x1b9/0x430
[97056.916053] [<ffffffff810838f2>] process_one_work+0x182/0x450
[97056.916053] [<ffffffff810846e1>] worker_thread+0x121/0x410
[97056.916053] [<ffffffff810845c0>] ? rescuer_thread+0x430/0x430
[97056.916053] [<ffffffff8108b3d2>] kthread+0xd2/0xf0
[97056.916053] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[97056.916053] [<ffffffff8172c5bc>] ret_from_fork+0x7c/0xb0
[97056.916053] [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[97056.916053] Code: 90 8b 0c 25 00 73 5f ff 80 e5 10 75 f2 89 f0 c1 e0 18 89 04 25 10 73 5f ff 41 83 ff 02 44 89 e8 41 0f 45 c7 89 04 25 00 73 5f ff <eb> 8a 4c 89 f7 57 9d 0f 1f 44 00 00 48 83 c4 10 5b 41 5c 41 5d

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1349711

Title:
  Machine lockup in btrfs-transaction

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  This has happened twice now.

  I'm on an AWS EC2 m3.large instance with the official Ubuntu AMI ami-
  776d9700.

  # cat /proc/version_signature
  Ubuntu 3.13.0-32.57-generic 3.13.11.4

  After running for many days, the machine locked up with the below
  messages appearing on the console. The machine would respond to ping
  but not SSH or HTTP requests. The machine has one BTRFS volume which
  is 87% full and lives on an Logical Volume Manager (LVM) block device
  on top of one Amazon Elastic Block Store (EBS) device.

  Error messages after first reboot:

  [   77.609490] BTRFS error (device dm-0): block group 10766778368 has wrong amount of free space
  [   77.613678] BTRFS error (device dm-0): failed to load free space cache for block group 10766778368
  [   77.643801] BTRFS error (device dm-0): block group 19356712960 has wrong amount of free space
  [   77.648952] BTRFS error (device dm-0): failed to load free space cache for block group 19356712960
  [   77.926325] BTRFS error (device dm-0): block group 20430454784 has wrong amount of free space
  [   77.931078] BTRFS error (device dm-0): failed to load free space cache for block group 20430454784
  [   78.111437] BTRFS error (device dm-0): block group 21504196608 has wrong amount of free space
  [   78.116165] BTRFS error (device dm-0): failed to load free space cache for block group 21504196608

  Error messages after second reboot:

  [   45.390221] BTRFS error (device dm-0): free space inode generation (0) did not match free space cache generation (70012)
  [   45.413472] BTRFS error (device dm-0): free space inode generation (0) did not match free space cache generation (70012)
  [  467.423961] BTRFS error (device dm-0): block group 518646661120 has wrong amount of free space
  [  467.429251] BTRFS error (device dm-0): failed to load free space cache for block group 518646661120

  Error messages on the console after second lock-up follow:

  [246736.752053] INFO: rcu_sched self-detected stall on CPU { 0}  (t=2220246 jiffies g=35399662 c=35399661 q=0)
  [246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
  [246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
  [246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
  [246792.192022] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
  [246792.212057] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
  [246820.192052] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
  [246820.212018] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
  [246848.192052] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
  [246848.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
  [246876.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [246876.212057] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [246904.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [246904.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [246916.772052] INFO: rcu_sched self-detected stall on CPU[246916.776058] INFO: rcu_sched detected stalls on CPUs/tasks:
  [246944.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [246944.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [246972.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [246972.212018] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [247000.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [247000.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [247028.192054] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
  [247028.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
  [247056.192053] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
  [247056.212061] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711/+subscriptions
References

[Bug 1349711] [NEW] Machine lockup in btrfs-transaction
From: Peter Waller, 2014-07-29