sts-sponsors team mailing list archive
-
sts-sponsors team
-
Mailing list archive
-
Message #01822
[Bug 1876230] Re: liburcu: Enable MEMBARRIER_CMD_PRIVATE_EXPEDITED to address performance problems with MEMBARRIER_CMD_SHARED
Answering question 2. I have done a comprehensive performance analysis
based on the benchmark application.
Note: The SRU changes how the sys_membarrier syscall is used. The
implementation that we want to change to in this SRU never blocks, while
the previous implementation does. This makes performance analysis
entirely workload dependant. On busy servers with lots of background
processes, sys_membarrier will block more often, compared to quiet
servers with no background processes.
The following is based on a quiet server with no background processes.
Test parameters
===============
Ubuntu 18.04.4
KVM, 2 vcpus
0.10.1 liburcu
4.15.0-99-generic
Test program "test_urcu[_bp]": http://paste.ubuntu.com/p/5vXVycQjYk/
(only difference is #include <urcu.h> or #include <urcu-bp.h>)
No changes to source code
=========================
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads 6065490002 nr_writes 237 nr_ops 6065490239
nr_reads 6476219475 nr_writes 186 nr_ops 6476219661
nr_reads 6474789528 nr_writes 183 nr_ops 6474789711
nr_reads 6476326433 nr_writes 188 nr_ops 6476326621
nr_reads 6479298142 nr_writes 179 nr_ops 6479298321
nr_reads 6476429569 nr_writes 186 nr_ops 6476429755
nr_reads 6478019994 nr_writes 191 nr_ops 6478020185
nr_reads 6479117595 nr_writes 183 nr_ops 6479117778
nr_reads 6478302181 nr_writes 185 nr_ops 6478302366
nr_reads 6481003399 nr_writes 191 nr_ops 6481003590
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads 644339902 nr_writes 485 nr_ops 644340387
nr_reads 644092800 nr_writes 1101 nr_ops 644093901
nr_reads 644676446 nr_writes 494 nr_ops 644676940
nr_reads 643845915 nr_writes 500 nr_ops 643846415
nr_reads 645156053 nr_writes 502 nr_ops 645156555
nr_reads 644626421 nr_writes 497 nr_ops 644626918
nr_reads 644710679 nr_writes 495 nr_ops 644711174
nr_reads 644445530 nr_writes 503 nr_ops 644446033
nr_reads 645150707 nr_writes 497 nr_ops 645151204
nr_reads 643681268 nr_writes 496 nr_ops 643681764
Commits c0bb9f and 374530 patched in
====================================
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads 4097663510 nr_writes 6516 nr_ops 4097670026
nr_reads 4177088332 nr_writes 4183 nr_ops 4177092515
nr_reads 4153780077 nr_writes 1907 nr_ops 4153781984
nr_reads 4150954044 nr_writes 3942 nr_ops 4150957986
nr_reads 4267855073 nr_writes 2102 nr_ops 4267857175
nr_reads 4131310825 nr_writes 7119 nr_ops 4131317944
nr_reads 4183771431 nr_writes 1919 nr_ops 4183773350
nr_reads 4270944170 nr_writes 4958 nr_ops 4270949128
nr_reads 4123277225 nr_writes 4228 nr_ops 4123281453
nr_reads 4266997284 nr_writes 1723 nr_ops 4266999007
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads 6530208343 nr_writes 8860 nr_ops 6530217203
nr_reads 6514357222 nr_writes 10568 nr_ops 6514367790
nr_reads 6517420660 nr_writes 9534 nr_ops 6517430194
nr_reads 6510005433 nr_writes 11799 nr_ops 6510017232
nr_reads 6492226563 nr_writes 12517 nr_ops 6492239080
nr_reads 6532405460 nr_writes 6548 nr_ops 6532412008
nr_reads 6514205150 nr_writes 9686 nr_ops 6514214836
nr_reads 6481643486 nr_writes 16167 nr_ops 6481659653
nr_reads 6509268022 nr_writes 10582 nr_ops 6509278604
nr_reads 6523168701 nr_writes 9066 nr_ops 6523177767
Comparing and contrasting with 20.04:
=====================================
Test Parameters:
================
Ubuntu 20.04 LTS
KVM, 2 vcpus
0.11.1 liburcu
5.4.0-29-generic
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads 4270089636 nr_writes 1638 nr_ops 4270091274
nr_reads 4281598850 nr_writes 3008 nr_ops 4281601858
nr_reads 4241230576 nr_writes 3612 nr_ops 4241234188
nr_reads 4230643208 nr_writes 5367 nr_ops 4230648575
nr_reads 4333495124 nr_writes 1354 nr_ops 4333496478
nr_reads 4291295097 nr_writes 3545 nr_ops 4291298642
nr_reads 4232582737 nr_writes 1983 nr_ops 4232584720
nr_reads 4268926719 nr_writes 3363 nr_ops 4268930082
nr_reads 4266736459 nr_writes 4881 nr_ops 4266741340
nr_reads 4313525276 nr_writes 4549 nr_ops 4313529825
ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads 6848011482 nr_writes 3171 nr_ops 6848014653
nr_reads 6842990129 nr_writes 4577 nr_ops 6842994706
nr_reads 6862298832 nr_writes 2875 nr_ops 6862301707
nr_reads 6849848255 nr_writes 4292 nr_ops 6849852547
nr_reads 6846387545 nr_writes 4975 nr_ops 6846392520
nr_reads 6860547626 nr_writes 3376 nr_ops 6860551002
nr_reads 6853028794 nr_writes 2784 nr_ops 6853031578
nr_reads 6846021299 nr_writes 3383 nr_ops 6846024682
nr_reads 6833359957 nr_writes 5917 nr_ops 6833365874
nr_reads 6851224193 nr_writes 2432 nr_ops 6851226625
Comparing and contrasting with 14.04:
=====================================
Test Parameters:
================
Ubuntu 14.04.6 LTS
KVM, 2 vcpus
0.7.12 liburcu
3.13.0-170-generic
ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu 6 2 10
nr_reads 284080749 nr_writes 790657 nr_ops 284871406
nr_reads 283785838 nr_writes 647058 nr_ops 284432896
nr_reads 273424217 nr_writes 1535098 nr_ops 274959315
nr_reads 283550711 nr_writes 1442548 nr_ops 284993259
nr_reads 282557773 nr_writes 946106 nr_ops 283503879
nr_reads 286811777 nr_writes 837176 nr_ops 287648953
nr_reads 273278986 nr_writes 1738549 nr_ops 275017535
nr_reads 287141686 nr_writes 652772 nr_ops 287794458
nr_reads 287697411 nr_writes 982440 nr_ops 288679851
nr_reads 281468419 nr_writes 830736 nr_ops 282299155
ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu_bp 6 2 10
nr_reads 670447719 nr_writes 16731 nr_ops 670464450
nr_reads 670464435 nr_writes 9970 nr_ops 670474405
nr_reads 670235233 nr_writes 4932 nr_ops 670240165
nr_reads 670853867 nr_writes 6845 nr_ops 670860712
nr_reads 670970962 nr_writes 307 nr_ops 670971269
nr_reads 670346111 nr_writes 8161 nr_ops 670354272
nr_reads 669748209 nr_writes 6824 nr_ops 669755033
nr_reads 671242419 nr_writes 249 nr_ops 671242668
nr_reads 670318007 nr_writes 8990 nr_ops 670326997
nr_reads 669872685 nr_writes 269 nr_ops 669872954
Analysis
========
We see from the two Bionic tests, we see the nr_ops go from 6065490239
to 4097670026 for test_urcu from unpatched to patched. This is a 1/3
performance impairment, numbers wise. However, if you compare with the
numbers from Focal, we see the results are in line with what you would
expect if you were running Focal, with 4097670026 vs 4270091274.
For test_urcu_bp, the two Bionic tests show a dramatic difference. We go
from 644340387 nr_ops for unpatched to 6530217203 nr_ops, which is a 10x
improvement. These numbers are in line with what you would expect on
Focal, with 6848014653 operations.
Comparing to Trusty, we see a wide performance improvement all around.
The next question is, is this benchmark an appropriate demonstration of
performance? Since the SRU is about changing the sys_membarrier syscall
command options, we should really be profiling based on the performance
of the syscall, as this indicates actual performance in real workloads,
since we block on sys_membarrier in the unpatched version, we would
expect the syscall to be invoked less.
Perf Performance Analysis on "sys_enter_membarrier" Tracepoint
==============================================================
No changes to source code
=========================
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads 5641721906 nr_writes 932 nr_ops 5641722838
607 syscalls:sys_enter_membarrier
nr_reads 6168632959 nr_writes 248 nr_ops 6168633207
595 syscalls:sys_enter_membarrier
nr_reads 6481069225 nr_writes 185 nr_ops 6481069410
567 syscalls:sys_enter_membarrier
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads 644124499 nr_writes 501 nr_ops 644125000
1 syscalls:sys_enter_membarrier
nr_reads 646275413 nr_writes 2287 nr_ops 646277700
1 syscalls:sys_enter_membarrier
nr_reads 644021303 nr_writes 494 nr_ops 644021797
1 syscalls:sys_exit_membarrier
Commits c0bb9f and 374530 patched in
====================================
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads 4322995476 nr_writes 3320 nr_ops 4322998796
835874 syscalls:sys_enter_membarrier
nr_reads 4210380395 nr_writes 2206 nr_ops 4210382601
883042 syscalls:sys_enter_membarrier
nr_reads 4233636203 nr_writes 3280 nr_ops 4233639483
867184 syscalls:sys_enter_membarrier
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads 6539807379 nr_writes 5289 nr_ops 6539812668
10578 syscalls:sys_enter_membarrier
nr_reads 6500401303 nr_writes 13287 nr_ops 6500414590
26574 syscalls:sys_enter_membarrier
nr_reads 6518640060 nr_writes 8780 nr_ops 6518648840
17560 syscalls:sys_enter_membarrier
Analysis
========
Now, this is some interesting data. Initially, with unchanged Bionic
source code, we see 607 sys_membarrier syscalls in 10 seconds for
test_urcu, and 1 sys_membarrier syscall for test_urcu_bp. In reality,
this is actually 0 syscalls, not 1, due to commit [1]:
64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 which removes the use of
sys_membarrier for urcu-bp due to major performance problems blocking
syscalls have in ltt-ng.
[1] https://github.com/urcu/userspace-rcu/commit/64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6
(note this was backported to 0.10.1 stable release, and is in Bionic)
Looking at the patched versions, we see test_urcu syscall count to
sys_membarrier skyrockets to 835874, a whopping 1377x increase. We went
from 60 syscalls/sec to 83587 syscalls/sec, which more or less
demonstrates that the patched liburcu spent less time in kernel space,
as syscalls did not block, and exited quickly.
The patches re-enable the use of sys_membarrier in the urcu-bp variant,
and we see the number of times the syscall was called was in the order
of magnitude of 10,000 - 20,000 times over 10 seconds. This is behind
the massive 10x performance increase in the number of operations the
test did, as it went from using userspace level memory barriers to
kernel space membarrier syscalls, which are much faster.
Conclusion
==========
This SRU changes liburcu to use the MEMBARRIER_CMD_PRIVATE_EXPEDITED
command of the sys_membarrier syscall, over the previous
MEMBARRIER_CMD_SHARED command.
MEMBARRIER_CMD_SHARED blocks as it must wait for all threads in the
system to agree on the view of memory, while with
MEMBARRIER_CMD_PRIVATE_EXPEDITED, only the threads in the local process
need to agree, and MEMBARRIER_CMD_PRIVATE_EXPEDITED is guaranteed to
never block.
With the non-blocking behaviour, we see sys_membarrier operate much more
quickly, and it can complete many more times per second than the
previous implementation which blocks.
For most workloads, not getting stuck on a blocking call to
sys_membarrier should improve application performance, and while the
benchmark programs do indicate a 1/3 drop in operations undertaken, in
the normal urcu variant, the performance is in line with what you would
expect from the current state of the art, in Focal.
I believe this SRU is a net benefit to the performance to liburcu.
--
You received this bug notification because you are a member of STS
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1876230
Title:
liburcu: Enable MEMBARRIER_CMD_PRIVATE_EXPEDITED to address
performance problems with MEMBARRIER_CMD_SHARED
Status in liburcu package in Ubuntu:
Fix Released
Status in liburcu source package in Bionic:
In Progress
Bug description:
[Impact]
In Linux 4.3, a new syscall was defined, called "membarrier". This
systemcall was defined specifically for use in userspace-rcu (liburcu)
to speed up the fast path / reader side of the library. The original
implementation in Linux 4.3 only supported the MEMBARRIER_CMD_SHARED
subcommand of the membarrier syscall.
MEMBARRIER_CMD_SHARED executes a memory barrier on all threads from
all processes running on the system. When it exits, the userspace
thread which called it is guaranteed that all running threads share
the same world view in regards to userspace addresses which are
consumed by readers and writers.
The problem with MEMBARRIER_CMD_SHARED is system calls made in this
fashion can block, since it deploys a barrier across all threads in a
system, and some other threads can be waiting on blocking operations,
and take time to reach the barrier.
In Linux 4.14, this was addressed by adding the
MEMBARRIER_CMD_PRIVATE_EXPEDITED command to the membarrier syscall. It
only targets threads which share the same mm as the thread calling the
membarrier syscall, aka, threads in the current process, and not all
threads / processes in the system.
Calls to membarrier with the MEMBARRIER_CMD_PRIVATE_EXPEDITED command
are guaranteed non-blocking, due to using inter-processor interrupts
to implement memory barriers.
Because of this, membarrier calls that use
MEMBARRIER_CMD_PRIVATE_EXPEDITED are much faster than those that use
MEMBARRIER_CMD_SHARED.
Since Bionic uses a 4.15 kernel, all kernel requirements are met, and
this SRU is to enable support for MEMBARRIER_CMD_PRIVATE_EXPEDITED in
the liburcu package.
This brings the performance of the liburcu library back in line to
where it was in Trusty, as this particular user has performance
problems upon upgrading from Trusty to Bionic.
[Test]
Testing performance is heavily dependant on the application which
links against liburcu, and the workload which it executes.
A test package is available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/sf276198-test
For the sake of testing, we can use the benchmarks provided in the
liburcu source code. Download a copy of the source code for liburcu
either from the repos or from github:
$ pull-lp-source liburcu bionic
# OR
$ git clone https://github.com/urcu/userspace-rcu.git
$ git checkout v0.10.1 # version in bionic
Build the code:
$ ./bootstrap
$ ./configure
$ make
Go into the tests/benchmark directory
$ cd tests/benchmark
From there, you can run benchmarks for the four main usages of
liburcu: urcu, urcu-bp, urcu-signal and urcu-mb.
On a 8 core machine, 6 threads for readers and 2 threads for writers,
with a 10 second runtime, execute:
$ ./test_urcu 6 2 10
$ ./test_urcu_bp 6 2 10
$ ./test_urcu_signal 6 2 10
$ ./test_urcu_mb 6 2 10
Results:
./test_urcu 6 2 10
0.10.1-1: 17612527667 reads, 268 writes, 17612527935 ops
0.10.1-1ubuntu1: 14988437247 reads, 810069 writes, 14989247316 ops
$ ./test_urcu_bp 6 2 10
0.10.1-1: 1177891079 reads, 1699523 writes, 1179590602 ops
0.10.1-1ubuntu1: 13230354737 reads, 575314 writes, 13230930051 ops
$ ./test_urcu_signal 6 2 10
0.10.1-1: 20128392417 reads, 6859 writes, 20128399276 ops
0.10.1-1ubuntu1: 20501430707 reads, 6890 writes, 20501437597 ops
$ ./test_urcu_mb 6 2 10
0.10.1-1: 627996563 reads, 5409563 writes, 633406126 ops
0.10.1-1ubuntu1: 653194752 reads, 4590020 writes, 657784772 ops
The SRU only changes behaviour for urcu and urcu-bp, since they are
the only "flavours" of liburcu which the patches change. From a pure
ops standpoint:
$ ./test_urcu 6 2 10
17612527935 ops
14989247316 ops
$ ./test_urcu_bp 6 2 10
1179590602 ops
13230930051 ops
We see that this particular benchmark workload, test_urcu sees extra
performance overhead with MEMBARRIER_CMD_PRIVATE_EXPEDITED, which is
explained by the extra impact that it has on the slowpath, and the
extra amount of writes it did during my benchmark.
The real winner in this benchmark workload is test_urcu_bp, which sees
a 10x performance increase with MEMBARRIER_CMD_PRIVATE_EXPEDITED. Some
of this may be down to the 3x less writes it did during my benchmark.
Again, these benchmarks are indicative only are very "random".
Performance is really dependant on the application which links against
liburcu and its workload.
[Regression Potential]
This SRU changes the behaviour of the following libraries which
applications link against: -lurcu and -lurcu-bp. Behaviour is not
changed in the rest: -lurcu-qsbr, -lucru-signal and -lucru-mb.
On Bionic, liburcu will call the membarrier syscall in urcu and urcu-
bp. This does not change. What is changing is the semantics of that
syscall, from MEMBARRIER_CMD_SHARED to
MEMBARRIER_CMD_PRIVATE_EXPEDITED. The changed code is all run in
kernel space and resides in the kernel. These commits simply change
the parameters which are supplied to the membarrier syscall from
liburcu.
I have run the testsuite that comes with the Bionic source code, and
"make regtest", "make short_bench" and "make long_bench" pass. You
want to run these on a cloud instance somewhere since they take
multiple hours.
If a regression were to occur, applications linked against -lurcu and
-lurcu-bp would be affected. The homepage: https://liburcu.org/ offers
a list of the major applications that use liburcu: Knot DNS, Netsniff-
ng, Sheepdog, GlusterFS, gdnsd and LTTng.
[Scope]
The two commits which are being SRU'd are:
commit c0bb9f693f926595a7cb8b4ce712cef08d9f5d49
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Thu Dec 21 13:42:23 2017 -0500
Subject: liburcu: Use membarrier private expedited when available
Link: https://github.com/urcu/userspace-rcu/commit/c0bb9f693f926595a7cb8b4ce712cef08d9f5d49
commit 3745305bf09e7825e75ee5b5490347ee67c6efdd
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Fri Dec 22 10:57:59 2017 -0500
Subject: liburcu-bp: Use membarrier private expedited when available
Link: https://github.com/urcu/userspace-rcu/commit/3745305bf09e7825e75ee5b5490347ee67c6efdd
Both cherry pick directly onto 0.10.1 in Bionic, and are originally
from 0.11.0, meaning that Eoan, Focal and Groovy already have the
patch.
[Other]
If you are interested in how the membarrier syscall works, you can
read their commits in the Linux kernel:
commit 5b25b13ab08f616efd566347d809b4ece54570d1
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Fri Sep 11 13:07:39 2015 -0700
Subject: sys_membarrier(): system-wide memory barrier (generic, x86)
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b25b13ab08f616efd566347d809b4ece54570d1
commit 22e4ebb975822833b083533035233d128b30e98f
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Fri Jul 28 16:40:40 2017 -0400
Subject: membarrier: Provide expedited private command
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22e4ebb975822833b083533035233d128b30e98f
Additionally, blog posts from LTTng:
https://lttng.org/blog/2018/01/15/membarrier-system-call-performance-and-userspace-rcu/
And Phoronix:
https://www.phoronix.com/scan.php?page=news_item&px=URCU-Membarrier-Performance
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/liburcu/+bug/1876230/+subscriptions