group.of.nepali.translators team mailing list archive
-
group.of.nepali.translators team
-
Mailing list archive
-
Message #24140
[Bug 1775235] Re: Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled
I built a test kernel with the patch you posted in the description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1775235
Can you test this kernel and see if it resolves this bug?
Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.
Thanks in advance!
** Changed in: linux (Ubuntu Xenial)
Status: Confirmed => In Progress
** Changed in: linux (Ubuntu)
Status: Confirmed => In Progress
** Changed in: linux (Ubuntu Xenial)
Assignee: (unassigned) => Joseph Salisbury (jsalisbury)
** Changed in: linux (Ubuntu)
Status: In Progress => Invalid
--
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1775235
Title:
Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled
Status in linux package in Ubuntu:
Invalid
Status in linux source package in Xenial:
In Progress
Bug description:
We noticed that Ubuntu 16.04 guests running on Nutanix AHV stopped
booting after they were upgraded to the latest kernel (4.4.0-127).
Only guests with scsi mq enabled suffered from this problem. AHV is
one of the few hypervisor products to offer multiqueue for virtio-scsi
devices.
Upon further investigation, we could see that the kernel would hang
during the scanning of scsi targets. More specifically, immediately
after coming across a target without any luns present. That's the
first time the kernel destroys a target (given it doesn't have luns).
This could be confirmed with gdb (attached to qemu's gdbserver):
#0 0xffffffffc0045039 in ?? ()
#1 0xffff88022c753c98 in ?? ()
#2 0xffffffff815d1de6 in scsi_target_destroy (starget=0xffff88022ad62400)
at /build/linux-E14mqW/linux-4.4.0/drivers/scsi/scsi_scan.c:322
This shows the guest vCPU stuck on virtio-scsi's implementation of
target_destroy. Despite lacking symbols, we managed to examine the
virtio_scsi_target_state to see that the 'reqs' counter was invalid:
(gdb) p *(struct virtio_scsi_target_state *)starget->hostdata
$6 = {tgt_seq = {sequence = 0}, reqs = {counter = -1}, req_vq = 0xffff88022cbdd9e8}
(gdb)
This drew our attention to the following patch which is exclusive to the Ubuntu kernel:
commit f1f609d8015e1d34d39458924dcd9524fccd4307
Author: Jay Vosburgh <jay.vosburgh@xxxxxxxxxxxxx>
Date: Thu Apr 19 21:40:00 2018 +0200
In a nutshell, the patch spins on the target's 'reqs' counter waiting for the target to quiesce:
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -785,6 +785,10 @@ static int virtscsi_target_alloc(struct scsi_target *starget)
static void virtscsi_target_destroy(struct scsi_target *starget)
{
struct virtio_scsi_target_state *tgt = starget->hostdata;
+
+ /* we can race with concurrent virtscsi_complete_cmd */
+ while (atomic_read(&tgt->reqs))
+ cpu_relax();
kfree(tgt);
}
Personally, I think this is a catastrophic way of waiting for a target
to quiesce since virtscsi_target_destroy() is called with IRQs
disabled from scsi_scan.c:scsi_target_destroy(). Devices which take a
long time to quiesce during a target_destroy() could hog the CPU for
relatively long periods of time.
Nevertheless, further study revealed that virtio-scsi itself is broken
in a way that it doesn't increment the 'reqs' counter when submitting
requests on MQ in certain conditions. That caused the counter to go to
-1 (on the completion of the first request) and the CPU to hang
indefinitely.
The following patch fixes the issue:
--- old/linux-4.4.0/drivers/scsi/virtio_scsi.c 2018-06-04 10:23:07.000000000 -0700
+++ new/linux-4.4.0/drivers/scsi/virtio_scsi.c 2018-06-05 10:03:29.083428545 -0700
@@ -641,9 +641,10 @@
scsi_target(sc->device)->hostdata;
struct virtio_scsi_vq *req_vq;
- if (shost_use_blk_mq(sh))
+ if (shost_use_blk_mq(sh)) {
req_vq = virtscsi_pick_vq_mq(vscsi, sc);
- else
+ atomic_inc(&tgt->reqs);
+ } else
req_vq = virtscsi_pick_vq(vscsi, tgt);
return virtscsi_queuecommand(vscsi, req_vq, sc);
Signed-off-by: Felipe Franciosi <felipe@xxxxxxxxxxx>
Please consider this a urgent fix as all of our customers which use
Ubuntu 16.04 and have MQ enabled for better performance will be
affected by your latest update. Our workaround is to recommend that
they disable SCSI MQ while you work on the issue.
Best regards,
Felipe
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1775235/+subscriptions