group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1833319] [NEW] Performance degradation when copying from LVM snapshot backed by NVMe disk

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Matthew Ruffell <1833319@xxxxxxxxxxxxxxxxxx>
Date: Tue, 18 Jun 2019 23:30:15 -0000
Reply-to: Bug 1833319 <1833319@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/1833319

[Impact]
When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk.

The kernel is not merging sector requests and is instead issuing many small
sector requests to the NVMe storage controller instead of one larger request.

Experiments have shown a 14x-25x performance degradation in reads, where
copies used to take seconds, now take minutes, and copies which took
thirty minutes now take many hours.

[Fix]

The following was found with btrace, running alongside cat (see
Testing):

Standard lvm copy:
$ cat /mnt/dummy1 1> /dev/null
LVM snapshot copy:
$ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null

Tracing:
# btrace /dev/nvme1n1 > trace.data

Looking at the "control" case, of copying from /mnt, which is the
standard lvm volume, see see a trace like:

259,0    1       13     0.002545516  1579  A   R 280576 + 512 <- (252,0) 278528
259,0    1       14     0.002545701  1579  Q   R 280576 + 512 [cat]
259,0    1       15     0.002547020  1579  G   R 280576 + 512 [cat]
259,0    1       16     0.002547631  1579  U   N [cat] 1
259,0    1       17     0.002547775  1579  I  RS 280576 + 512 [cat]
259,0    1       18     0.002551381  1579  D  RS 280576 + 512 [cat]
259,0    1       19     0.004099666     0  C  RS 280576 + 512 [0]

A = IO remapped to different device
Q = IO handled by request queue
G = Get request
U = Unplug request
I = IO inserted onto request queue
D = IO issued to driver
C = IO completion

Firstly, the request is mapped from a different device, from /mnt which
is dm-1 to the nvme disk. A 512 sector read is placed on the IO request
queue, where it is then inserted into the driver request queue and then
the driver is commanded to fetch the data, and then it completes.

Now, when reading from the LVM snapshot, we see:

259,0    1      113     0.001117160  1606  A   R 837872 + 8 <- (252,0) 835824
259,0    1      114     0.001117276  1606  Q   R 837872 + 8 [cat]
259,0    1      115     0.001117451  1606  G   R 837872 + 8 [cat]
259,0    1      116     0.001117979  1606  A   R 837880 + 8 <- (252,0) 835832
259,0    1      117     0.001118119  1606  Q   R 837880 + 8 [cat]
259,0    1      118     0.001118285  1606  G   R 837880 + 8 [cat]
259,0    1      122     0.001121613  1606  I  RS 837640 + 8 [cat]
259,0    1      123     0.001121687  1606  I  RS 837648 + 8 [cat]
259,0    1      124     0.001121758  1606  I  RS 837656 + 8 [cat]
...
259,0    1      154     0.001126118   377  D  RS 837648 + 8 [kworker/1:1H]
259,0    1      155     0.001126445   377  D  RS 837656 + 8 [kworker/1:1H]
259,0    1      156     0.001126871   377  D  RS 837664 + 8 [kworker/1:1H]
...
259,0    1      183     0.001848512     0  C  RS 837632 + 8 [0]

Now what is happening here, is that a request for 8 sector read is
placed onto the IO request queue, and is then inserted one at a time to
the driver request queue and then fetched by the driver.

Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+
or the Ubuntu 4.15 HWE kernel:

M = IO back merged with request on queue

259,0    0      194     0.000532515  1897  A   R 7358960 + 8 <- (253,0) 7356912
259,0    0      195     0.000532634  1897  Q   R 7358960 + 8 [cat]
259,0    0      196     0.000532810  1897  M   R 7358960 + 8 [cat]
259,0    0      197     0.000533864  1897  A   R 7358968 + 8 <- (253,0) 7356920
259,0    0      198     0.000533991  1897  Q   R 7358968 + 8 [cat]
259,0    0      199     0.000534177  1897  M   R 7358968 + 8 [cat]
259,0    0      200     0.000534474  1897 UT   N [cat] 1
259,0    0      201     0.000534586  1897  I   R 7358464 + 512 [cat]
259,0    0      202     0.000537055  1897  D   R 7358464 + 512 [cat]
259,0    0      203     0.002242539     0  C   R 7358464 + 512 [0]

This shows us a 8 sector read is added to the request queue, and is then
subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes.

The problem is that the 4.4 xenial kernel is not merging 8 sector
requests.

I came across this bugzilla entry,

https://bugzilla.kernel.org/show_bug.cgi?id=117051

and we see that merging is controlled by a sysfs entry,
/sys/block/nvme1n1/queue/nomerges

On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES).
On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge.

Setting this to 0 on the 4.4 kernel with:

# echo "0" > /sys/block/nvme1n1/queue/nomerges

and testing again, we find performance is restored and the problem is
fixed.

Looking at the trace with btrace, we see that performs 8 sector reads,
which get backmerged into a 512 sector read which is done in one go.

Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we
come across

commit ef2d4615c59efb312e531a5e949970f37ca1c841
Author: Keith Busch <keith.busch@xxxxxxxxx>
Date:   Thu Feb 11 13:05:40 2016 -0700
Subject: NVMe: Allow request merges

This commit removes the QUEUE_FLAG_NOMERGES flag from being set during
driver init, allowing requests to be backmerged. This also has a direct
effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0.

Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4
kernels.

[Testcase]

You can replicate the problem with a system with a NVMe disk. I
recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS
disk of 200gb or larger.

Steps (with NVMe disk being /dev/nvme1n1):
  1. sudo pvcreate /dev/nvme1n1
  2. sudo vgcreate secvol /dev/nvme1n1
  3. sudo lvcreate --name seclv -l 80%FREE secvol
  4. sudo mkfs.ext4 /dev/secvol/seclv
  5. sudo mount /dev/mapper/secvol-seclv /mnt
  6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done
  7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE'
  8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX)
  9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT

To replicate, simply read one of those 512mb files:
  10. time cat $NEWMOUNT/dummy1 1> /dev/null

On a stock xenial kernel, expect to see the following:

4.4.0-151-generic #178-Ubuntu

$ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null

real	0m42.693s
user	0m0.008s
sys	0m0.388s
$ cat /sys/block/nvme1n1/queue/nomerges
2

On a patched xenial kernel, performance is restored:

4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu

$ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null

real	0m1.773s
user	0m0.008s
sys	0m0.184s
$ cat /sys/block/nvme1n1/queue/nomerges
0

[Regression Potential]

Cherry picking "NVMe: Allow request merges" changes the default request
policy for NVMe drives, which may give some cause for concern in both
terms of stability and performance for other workloads.

Regarding stability, this flag was originally set when the NVMe driver was
bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here:

https://lists.infradead.org/pipermail/linux-
nvme/2016-February/003946.html

Along with the commit "MD: make bio mergeable" there is no reason to not
allow requests to be mergeable for the new NVMe driver.

Regarding performance for other workloads, I reference the commit which
QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced:
commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c
subject: block: Added in stricter no merge semantics for block I/O

nomerges        Throughput      %System         Improvement (tput / %sys)
--------        ------------    -----------     -------------------------
0               12.45 MB/sec    0.669365609
1               12.50 MB/sec    0.641519199     0.40% / 2.71%
2               12.52 MB/sec    0.639849750     0.56% / 2.96%

It shows a 0.56% performance increase for no merging / 2, over allowing
merging / 0 for random IO workloads.

Comparing this with the 14x-25x performance degradation for reads where
requests are not able to be merged, it is clear that changing the
default to 0 will not impact any other workloads by any significant
margin.

The commit is also present in Linux 4.5 mainline, can be cleanly cherry
picked and is still present in the kernel to this day, and after review
of the NVMe driver, I believe there will be no regressions.

If you are interested in testing, I have prepared two ppas with
ef2d4615c59efb312e531a5e949970f37ca1c841 patched:

linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic
linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Xenial)
     Importance: Undecided
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: sts

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Tags added: sts

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1833319

Title:
  Performance degradation when copying from LVM snapshot backed by NVMe
  disk

Status in linux package in Ubuntu:
  New
Status in linux source package in Xenial:
  In Progress

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1833319

  [Impact]
  When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk.

  The kernel is not merging sector requests and is instead issuing many small
  sector requests to the NVMe storage controller instead of one larger request.

  Experiments have shown a 14x-25x performance degradation in reads,
  where copies used to take seconds, now take minutes, and copies which
  took thirty minutes now take many hours.

  [Fix]

  The following was found with btrace, running alongside cat (see
  Testing):

  Standard lvm copy:
  $ cat /mnt/dummy1 1> /dev/null
  LVM snapshot copy:
  $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null

  Tracing:
  # btrace /dev/nvme1n1 > trace.data

  Looking at the "control" case, of copying from /mnt, which is the
  standard lvm volume, see see a trace like:

  259,0    1       13     0.002545516  1579  A   R 280576 + 512 <- (252,0) 278528
  259,0    1       14     0.002545701  1579  Q   R 280576 + 512 [cat]
  259,0    1       15     0.002547020  1579  G   R 280576 + 512 [cat]
  259,0    1       16     0.002547631  1579  U   N [cat] 1
  259,0    1       17     0.002547775  1579  I  RS 280576 + 512 [cat]
  259,0    1       18     0.002551381  1579  D  RS 280576 + 512 [cat]
  259,0    1       19     0.004099666     0  C  RS 280576 + 512 [0]

  A = IO remapped to different device
  Q = IO handled by request queue
  G = Get request
  U = Unplug request
  I = IO inserted onto request queue
  D = IO issued to driver
  C = IO completion

  Firstly, the request is mapped from a different device, from /mnt
  which is dm-1 to the nvme disk. A 512 sector read is placed on the IO
  request queue, where it is then inserted into the driver request queue
  and then the driver is commanded to fetch the data, and then it
  completes.

  Now, when reading from the LVM snapshot, we see:

  259,0    1      113     0.001117160  1606  A   R 837872 + 8 <- (252,0) 835824
  259,0    1      114     0.001117276  1606  Q   R 837872 + 8 [cat]
  259,0    1      115     0.001117451  1606  G   R 837872 + 8 [cat]
  259,0    1      116     0.001117979  1606  A   R 837880 + 8 <- (252,0) 835832
  259,0    1      117     0.001118119  1606  Q   R 837880 + 8 [cat]
  259,0    1      118     0.001118285  1606  G   R 837880 + 8 [cat]
  259,0    1      122     0.001121613  1606  I  RS 837640 + 8 [cat]
  259,0    1      123     0.001121687  1606  I  RS 837648 + 8 [cat]
  259,0    1      124     0.001121758  1606  I  RS 837656 + 8 [cat]
  ...
  259,0    1      154     0.001126118   377  D  RS 837648 + 8 [kworker/1:1H]
  259,0    1      155     0.001126445   377  D  RS 837656 + 8 [kworker/1:1H]
  259,0    1      156     0.001126871   377  D  RS 837664 + 8 [kworker/1:1H]
  ...
  259,0    1      183     0.001848512     0  C  RS 837632 + 8 [0]

  Now what is happening here, is that a request for 8 sector read is
  placed onto the IO request queue, and is then inserted one at a time
  to the driver request queue and then fetched by the driver.

  Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+
  or the Ubuntu 4.15 HWE kernel:

  M = IO back merged with request on queue

  259,0    0      194     0.000532515  1897  A   R 7358960 + 8 <- (253,0) 7356912
  259,0    0      195     0.000532634  1897  Q   R 7358960 + 8 [cat]
  259,0    0      196     0.000532810  1897  M   R 7358960 + 8 [cat]
  259,0    0      197     0.000533864  1897  A   R 7358968 + 8 <- (253,0) 7356920
  259,0    0      198     0.000533991  1897  Q   R 7358968 + 8 [cat]
  259,0    0      199     0.000534177  1897  M   R 7358968 + 8 [cat]
  259,0    0      200     0.000534474  1897 UT   N [cat] 1
  259,0    0      201     0.000534586  1897  I   R 7358464 + 512 [cat]
  259,0    0      202     0.000537055  1897  D   R 7358464 + 512 [cat]
  259,0    0      203     0.002242539     0  C   R 7358464 + 512 [0]

  This shows us a 8 sector read is added to the request queue, and is then
  subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes.

  The problem is that the 4.4 xenial kernel is not merging 8 sector
  requests.

  I came across this bugzilla entry,

  https://bugzilla.kernel.org/show_bug.cgi?id=117051

  and we see that merging is controlled by a sysfs entry,
  /sys/block/nvme1n1/queue/nomerges

  On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES).
  On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge.

  Setting this to 0 on the 4.4 kernel with:

  # echo "0" > /sys/block/nvme1n1/queue/nomerges

  and testing again, we find performance is restored and the problem is
  fixed.

  Looking at the trace with btrace, we see that performs 8 sector reads,
  which get backmerged into a 512 sector read which is done in one go.

  Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we
  come across

  commit ef2d4615c59efb312e531a5e949970f37ca1c841
  Author: Keith Busch <keith.busch@xxxxxxxxx>
  Date:   Thu Feb 11 13:05:40 2016 -0700
  Subject: NVMe: Allow request merges

  This commit removes the QUEUE_FLAG_NOMERGES flag from being set during
  driver init, allowing requests to be backmerged. This also has a
  direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0.

  Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4
  kernels.

  [Testcase]

  You can replicate the problem with a system with a NVMe disk. I
  recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS
  disk of 200gb or larger.

  Steps (with NVMe disk being /dev/nvme1n1):
    1. sudo pvcreate /dev/nvme1n1
    2. sudo vgcreate secvol /dev/nvme1n1
    3. sudo lvcreate --name seclv -l 80%FREE secvol
    4. sudo mkfs.ext4 /dev/secvol/seclv
    5. sudo mount /dev/mapper/secvol-seclv /mnt
    6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done
    7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE'
    8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX)
    9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT

  To replicate, simply read one of those 512mb files:
    10. time cat $NEWMOUNT/dummy1 1> /dev/null

  On a stock xenial kernel, expect to see the following:

  4.4.0-151-generic #178-Ubuntu

  $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null

  real	0m42.693s
  user	0m0.008s
  sys	0m0.388s
  $ cat /sys/block/nvme1n1/queue/nomerges
  2

  On a patched xenial kernel, performance is restored:

  4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu

  $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null

  real	0m1.773s
  user	0m0.008s
  sys	0m0.184s
  $ cat /sys/block/nvme1n1/queue/nomerges
  0

  [Regression Potential]

  Cherry picking "NVMe: Allow request merges" changes the default
  request policy for NVMe drives, which may give some cause for concern
  in both terms of stability and performance for other workloads.

  Regarding stability, this flag was originally set when the NVMe driver was
  bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here:

  https://lists.infradead.org/pipermail/linux-
  nvme/2016-February/003946.html

  Along with the commit "MD: make bio mergeable" there is no reason to
  not allow requests to be mergeable for the new NVMe driver.

  Regarding performance for other workloads, I reference the commit which
  QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced:
  commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c
  subject: block: Added in stricter no merge semantics for block I/O

  nomerges        Throughput      %System         Improvement (tput / %sys)
  --------        ------------    -----------     -------------------------
  0               12.45 MB/sec    0.669365609
  1               12.50 MB/sec    0.641519199     0.40% / 2.71%
  2               12.52 MB/sec    0.639849750     0.56% / 2.96%

  It shows a 0.56% performance increase for no merging / 2, over allowing
  merging / 0 for random IO workloads.

  Comparing this with the 14x-25x performance degradation for reads
  where requests are not able to be merged, it is clear that changing
  the default to 0 will not impact any other workloads by any
  significant margin.

  The commit is also present in Linux 4.5 mainline, can be cleanly
  cherry picked and is still present in the kernel to this day, and
  after review of the NVMe driver, I believe there will be no
  regressions.

  If you are interested in testing, I have prepared two ppas with
  ef2d4615c59efb312e531a5e949970f37ca1c841 patched:

  linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic
  linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1833319/+subscriptions
Follow ups

[Bug 1833319] Re: Performance degradation when copying from LVM snapshot backed by NVMe disk
From: Launchpad Bug Tracker, 2019-07-24