← Back to team overview

kernel-packages team mailing list archive

[Bug 1382801] Re: XFS: mount hangs for corrupted filesystem

 

========
ANALYSIS
========

#0 [ffff880036a73b38] schedule at ffffffff8175e320
#1 [ffff880036a73bc0] xfs_ail_push_all_sync at ffffffffa02e5478 [xfs]
#2 [ffff880036a73c30] xfs_log_quiesce at ffffffffa02e0b67 [xfs]
#3 [ffff880036a73c50] xfs_log_unmount at ffffffffa02e0bb6 [xfs]
#4 [ffff880036a73c70] xfs_mountfs at ffffffffa029332a [xfs]

Analyzing the stack trace we can see that, during a xfs_fs_mount, the only
possible way to get into "xfs_log_unmount" is if the XFS filesystem is
corrupted:

636 xfs_mountfs(
637 xfs_mount_t *mp)
638 {
639 xfs_sb_t *sbp = &(mp->m_sb);
640 xfs_inode_t *rip;
...
836 /*
837 * Get and sanity-check the root inode.
838 * Save the pointer to it in the mount structure.
839 */
840 error = xfs_iget(mp, NULL, sbp->sb_rootino, 0, XFS_ILOCK_EXCL, &rip);
841 if (error) {
842 xfs_warn(mp, "failed to read root inode");
843 goto out_log_dealloc;
844 }
845
...
955 out_log_dealloc:
956 xfs_log_unmount(mp);
...

So we DO KNOW your XFS is considered to be CORRUPTED (by XFS function
xfs_iget(), called for the root inode as a sanity check).

Either way, lets continue debugging to make sure we understand why XFS
didn't give us an error about the filesystem being corrupted:

Following the stack:

#2 [ffff880036a73c30] xfs_log_quiesce at ffffffffa02e0b67 [xfs]
#3 [ffff880036a73c50] xfs_log_unmount at ffffffffa02e0bb6 [xfs]

We can see that xfs_log_quiesce calls

#1 [ffff880036a73bc0] xfs_ail_push_all_sync at ffffffffa02e5478 [xfs]

The function responsible to push ALL *AIL structure into disk (for unmount
and freeze purposes).

This function has a simple code:

600 struct xfs_log_item *lip;
601 DEFINE_WAIT(wait);
602
603 spin_lock(&ailp->xa_lock);
604 while ((lip = xfs_ail_max(ailp)) != NULL) {
605 prepare_to_wait(&ailp->xa_empty, &wait, TASK_UNINTERRUPTIBLE);
606 ailp->xa_target = lip->li_lsn;
607 wake_up_process(ailp->xa_task);
608 spin_unlock(&ailp->xa_lock);
609 schedule();
610 spin_lock(&ailp->xa_lock);
611 }
612 spin_unlock(&ailp->xa_lock);
613
614 finish_wait(&ailp->xa_empty, &wait);

Where it gets all "xfs_log_items" from the AIL double linked list and
calls the function responsible to commit this "log items" into the disk:

607 wake_up_process(ailp->xa_task);

So there are 2 possible things happening:

1) XFS is stuck inside this loop because of something happening on the
ail (xa_task) callback function (responsible to commit xfs log items).
And this, of course, makes the "mount" process to hang in the "UNINTE-
RRUPTIBLE state (since its not safe to let userland kill this process).

OBS: We cannot continue analyzing because we lack "core" file (that would
give us the stackstrace for the kernel thread responsible for the callback)
despite our efforts to get it during the crisis.

2) XFS is stuck inside this loop because there are many log items
to be committed (like it could happen in a stress test scenario) that
were not commit yet.

OBS: We cannot continue because we lack the sosreport, that could be
pointing out to us that the task is being held for more then XX seconds.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1382801

Title:
  XFS: mount hangs for corrupted filesystem

Status in “linux” package in Ubuntu:
  Confirmed

Bug description:
  It was brought to my attention this situation:

  --------
  mount hangs at the following stack:
  crash> bt 2882
  PID: 2882 TASK: ffff88084e75c800 CPU: 7 COMMAND: "mount"
  #0 [ffff880036a73b38] schedule at ffffffff8175e320
  #1 [ffff880036a73bc0] xfs_ail_push_all_sync at ffffffffa02e5478 [xfs]
  #2 [ffff880036a73c30] xfs_log_quiesce at ffffffffa02e0b67 [xfs]
  #3 [ffff880036a73c50] xfs_log_unmount at ffffffffa02e0bb6 [xfs]
  #4 [ffff880036a73c70] xfs_mountfs at ffffffffa029332a [xfs]
  #5 [ffff880036a73ce0] xfs_fs_fill_super at ffffffffa0296707 [xfs]
  #6 [ffff880036a73d20] mount_bdev at ffffffff811cd4a9
  #7 [ffff880036a73db0] xfs_fs_mount at ffffffffa02946f5 [xfs]
  #8 [ffff880036a73dc0] mount_fs at ffffffff811ce123
  #9 [ffff880036a73e10] vfs_kern_mount at ffffffff811e9bf6
  #10 [ffff880036a73e60] do_new_mount at ffffffff811eb3a4
  #11 [ffff880036a73ec0] do_mount at ffffffff811ec706
  #12 [ffff880036a73f20] sys_mount at ffffffff811ecad0
  #13 [ffff880036a73f80] system_call_fastpath at ffffffff8176ae2d
  RIP: 00007f2340eb6c2a RSP: 00007fff25675368 RFLAGS: 00010206
  RAX: 00000000000000a5 RBX: ffffffff8176ae2d RCX: 0000000000000026
  RDX: 0000000000b04c20 RSI: 0000000000b04bf0 RDI: 0000000000b04bd0
  RBP: 00000000c0ed0400 R8: 0000000000b04c70 R9: 0000000000000001
  R10: ffffffffc0ed0400 R11: 0000000000000202 R12: 0000000000b04bf0
  R13: 0000000000b04b50 R14: 0000000000000400 R15: 0000000000000000
  ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b

  The corresponding disk is /dev/sdd1, any IO (xfs_check, etc) also
  hangs and had "D" state.

  This reproducible with 3.11 and 3.13 kernel both.

  The storage node is out of service because of this problem
  --------

  I'm still asking for more data (sosreport and kernel dump).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1382801/+subscriptions


References