kernel-packages team mailing list archive

Thread
Date
[Bug 897421] Re: cannot unfreeze filesystem due to a deadlock due to multipath failover

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: "Christopher M. Penalver" <christopher.m.penalver@xxxxxxxxx>
Date: Sun, 18 Aug 2013 20:39:30 -0000
Reply-to: Bug 897421 <897421@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Tags added: needs-kernel-logs needs-upstream-testing

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/897421

Title:
   cannot unfreeze filesystem due to a deadlock due to multipath
  failover

Status in “linux” package in Ubuntu:
  Confirmed
Status in “linux” source package in Oneiric:
  Invalid
Status in “linux” source package in Precise:
  Confirmed

Bug description:
  To reproduce:
  - 2 or more servers using shared storage (SAN)
  - Each loaded with iozone on attached data luns  (/s1, /s2. /s3)
   iozone -R -l 2 -u 2 -r 4k -s 100m
  - Each system has three data luns of 10G, the root filesystem is not stressed
  - A failover injected every 6 mins (this can happen on the first failover)
  - dmesg -n 8 as root from serial consoles on all systems
  - kdump configured
  - set sysctl kernel.hung_task_panic = 1

  Regardless of whether the HBA enters error handling or not. After a
  path is broken, and now comes back, is when the hang occurs.

  In simplest terms, the OS via UDEV is recreating the once broken path
  by instantiating block devices and creating symlinks. To do this it runs
  the following udev rule: /lib/udev/rules.d/95-kpartx.rules

  # Create dm tables for partitions
  ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
          RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"

  Which acquires the s_umount semaphore for serialization and freezes the
  block device, and thus the filesystem, while it's adding additional partitions
  to the root block device.

  At the same time a flush thread for the block device in question begins to
  writeback dirty pages, that also acquires the s_umount semaphore, before
  kpartx does, and finally sleeps on the signal for the block device to
  become unfrozen.

  Since kpartx is trying obtain a write_lock on the s_umount semaphore, and
  the flush thread is already asleep holding a read_lock on s_umount, kpartx
  can never enter the critical section to unfreeze the block device. Since the
  flush thread is also sleeping on the condition of the block device being
  unfrozen, it is also deadlocked.

  Root Cause:
  After exhausting the write_down instrumentation and
  not finding any other instances competing for the write_down
  I changed focus to the primary hung thread, the write back flush.

  Back to the kpartx hang:
  thaw_bdev...

          down_write(&sb->s_umount); <== hang here

          if (sb->s_flags & MS_RDONLY)
                  goto out_unfrozen;

          if (sb->s_op->unfreeze_fs) {
                  error = sb->s_op->unfreeze_fs(sb);
                  if (error) {
                          printk(KERN_ERR
                                  "VFS:Filesystem thaw failed\n");
                          sb->s_frozen = SB_FREEZE_TRANS;
                          bdev->bd_fsfreeze_count++;
                          mutex_unlock(&bdev->bd_fsfreeze_mutex);
                          return error;
                  }
          }

  out_unfrozen:
          sb->s_frozen = SB_UNFROZEN;
          smp_wmb();
          wake_up(&sb->s_wait_unfrozen);

  Were we to successfully exit, we change the superblock to unfrozen.
  However the flush thread is sleeping, waiting for the super_block
  to become unfrozen.

  int ext4_force_commit(struct super_block *sb)
  {
          journal_t *journal;
          int ret = 0;

          if (sb->s_flags & MS_RDONLY)
                  return 0;

          journal = EXT4_SB(sb)->s_journal;
          if (journal) {
                  vfs_check_frozen(sb, SB_FREEZE_TRANS); <=== this is where sleep
                  ret = ext4_journal_force_commit(journal);
          }

          return ret;
  }

  enum {
          SB_UNFROZEN = 0,
          SB_FREEZE_WRITE = 1,
          SB_FREEZE_TRANS = 2,
  };

  #define vfs_check_frozen(sb, level) \
          wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))

  crash-5.0> super_block.s_frozen ffff880268a4e000
    s_frozen = 0x2,

  So why can't thaw_bdev make any forward progress? There's a reader
  out there, that's holding the s_umount sema somewhere in this call
  stack.

  PID: 992 TASK: ffff8802678a8000 CPU: 7 COMMAND: "flush-251:5"
   #0 [ffff880267bddb00] schedule at ffffffff8158bcbd
   #1 [ffff880267bddbb8] ext4_force_commit at ffffffff8120b16d
   #2 [ffff880267bddc18] ext4_write_inode at ffffffff811f29e5
   #3 [ffff880267bddc68] writeback_single_inode at ffffffff81178964
   #4 [ffff880267bddcb8] writeback_sb_inodes at ffffffff81178f09
   #5 [ffff880267bddd18] wb_writeback at ffffffff8117995c
  (down_read(sb->s_umount) taken here)

   #6 [ffff880267bdddc8] wb_do_writeback at ffffffff81179b6b
   #7 [ffff880267bdde58] bdi_writeback_task at ffffffff81179cc3
   #8 [ffff880267bdde98] bdi_start_fn at ffffffff8111e816
   #9 [ffff880267bddec8] kthread at ffffffff81088a06
  #10 [ffff880267bddf48] kernel_thread at ffffffff810142ea

  and as long as there's an active reader, the writer can't
  change anything. After some disection the likely culprit is
  in frame #5

  (We must have gotten here through writeback_inodes_wb)

   517 void writeback_inodes_wb(struct bdi_writeback *wb,
   518 struct writeback_control *wbc)
   519 {
   520 int ret = 0;
   521
   522 wbc->wb_start = jiffies; /* livelock avoidance */
   523 spin_lock(&inode_lock);
   524 if (!wbc->for_kupdate || list_empty(&wb->b_io))
   525 queue_io(wb, wbc->older_than_this);
   526
   527 while (!list_empty(&wb->b_io)) {
   528 struct inode *inode = list_entry(wb->b_io.prev,
   529 struct inode, i_list);
   530 struct super_block *sb = inode->i_sb;
   531

  !!! This is where the read_down is taken !!!

   532 if (!pin_sb_for_writeback(sb)) { <== performs read_try_lock on s_umount
   533 requeue_io(inode);
   534 continue;
   535 }
   536 ret = writeback_sb_inodes(sb, wb, wbc, false);

  You must have the successfully grabbed s_umount for reading before
  reaching this point.

  Thus the deadlock, the flush thread will wait to be unfrozen forever
  because it's sleeping with a read lock on s_umount, which prevents
  the write lock from making any forward progress in thaw_bdev, so
  s_frozen will never be set to UNFROZEN, triggering the waitq and
  allowing the flush to complete.

  This signature is identical to a issue just proposed on the fs-dev
  lists 14 days ago. There's also a test case of applying a simple
  "sync" in a loop. Which adds more credibility to the failover
  hanging on the SCM that isn't under load. It's not the traffic
  that's the issue, it's the writeback that was forced, coupled
  with the freeze/thaw action thanks to udev and we have the conditions
  for the deadlock.

  http://66.135.57.166/lists/linux-fsdevel/msg42068.html

  Because it's really related to "sync" it doesn't matter what filesystem
  you use.

  The proposed solution is against 2.6.38, I've tried parts of it
  already with no success. The full patch will require dramatic
  changes to the superblock just for it to apply, which of course
  could pose even more issues. We can finally say however that the
  root cause has been identified.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/897421/+subscriptions