group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1791790] Re: Kernel hang on drive pull caused by incomplete backport for bug 1597908

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Joseph Salisbury <joseph.salisbury@xxxxxxxxxxxxx>
Date: Wed, 12 Sep 2018 19:17:46 -0000
Reply-to: Bug 1791790 <1791790@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => Joseph Salisbury (jsalisbury)

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
       Status: Confirmed => In Progress

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Xenial)
       Status: New => In Progress

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Xenial)
     Assignee: (unassigned) => Joseph Salisbury (jsalisbury)

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by incomplete backport for bug
  1597908

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  In Progress

Bug description:
  A bug was introduced when backporting the fix for
  http://bugs.launchpad.net/bugs/1597908. This bug exists in all Ubuntu
  16.04 LTS 4.4 kernels >= 4.4.0-36, and many other non-LTS kernels.

  This patch changes the context in which timeout work is scheduled for
  block devices in the kernel. Previously, timeout work was executed
  directly from the timer callback that fired when a deadline was met.
  After the patch, timeout work is scheduled using a background work
  queue. This means that by the time the work executes, the device queue
  which originally scheduled the work could be torn down. In order to
  prevent this, the patch takes a reference on the device queue when
  executing the timeout work.

  The problem is that the last reference to this queue can be removed
  before the timeout work can be executed. During teardown, the block
  system executes a freeze followed by a drain. The freeze drops the
  last reference on the queue. The drain tries to clean up any
  outstanding work, including timeout work. After a freeze, the timeout
  work in the background queue is unable to obtain a reference, and
  exits early without completing work. The work is now permanently stuck
  in the queue and it will never be completed. The drain in the device
  teardown path spins indefinitely.

  The bug manifests as a hang that looks like this:
  [<ffffffff81829f15>] schedule+0x35/0x80
  [<ffffffffc014aea9>] hpsa_scan_start+0x109/0x140 [hpsa]
  [<ffffffff810c3cb0>] ? wake_atomic_t_function+0x60/0x60
  [<ffffffffc014b602>] hpsa_rescan_ctlr_worker+0x1d2/0x652 [hpsa]
  [<ffffffff8109a2c5>] process_one_work+0x165/0x480
  [<ffffffff8109a62b>] worker_thread+0x4b/0x4c0
  [<ffffffff8109a5e0>] ? process_one_work+0x480/0x480
  [<ffffffff810a0808>] kthread+0xd8/0xf0
  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
  [<ffffffff8182e38f>] ret_from_fork+0x3f/0x70
  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0

  The fix exists upstream. It applies, builds, and runs cleanly on Ubuntu's most recent 4.4 kernel.
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=4e9b6f20828ac880dbc1fa2fdbafae779473d1af

  We hit this bug nearly 100% of the time on some of our HP hardware.
  The HPSA driver has a tendency to aggressively remove missing devices,
  so it widens the race. As a result, we've been building our own kernel
  with this patch applied. It would be really nice if we could get it
  into mainline Ubuntu.

  Let me know what additional information is needed. Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions