maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #06693
Next steps in improving single-threaded performance
I have been analysing CPU bottlenecks in single-threaded sysbench read-only
load. I found that icache misses is the main bottleneck, and that
profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25%
or more.
(More details in my blog posts:
http://kristiannielsen.livejournal.com/17676.html
http://kristiannielsen.livejournal.com/18168.html
)
Now I would like to ask for some discussions/help in how to get this
implemented in practice. It involves changing the build process for our
binaries: First compile with gcc --coverage, then run some profile workload,
then recompile with -fprofile-use.
I implemented a simple program to generate some profile load:
https://github.com/knielsen/gen_profile_load
It runs a bunch of simple insert/select/update/delete, with different
combinations of storage engine, binlog format, and client API. It is designed
to run inside the build tree and handle starting and stopping the server being
tested, so it is pretty close to a working setup. These commands work to
generate a binary that is faster due to PGO:
mkdir bld
cd bld
cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" ..
make
tests/gen_profile_load
cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction"
make
So all the pieces really are there, it should be possible to implement it. But
we need to find a good way to integrate it into our build system.
The best would be to integrate it into our cmake files.
The gen_profile_load.c could go into tests/, ideally we would build both a
static and dynamically linked version (so we get PGO for both libmysqlclient.a
and libmysqlclient.so). Anyone can help me get cmake to do that?
And it would be cool if we could get the above procedure to work completely
within cmake, so that the user could just do:
cmake -DWITH_PGO ... ; make
and cmake would itself handle first building with --coverage, then running
gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with
-fprofile-use. Anyone know if this is possible with cmake, and if so could
help implement it?
But alternatively, we could integrate a double build, like the commands above,
into the buildbot scripts (.deb, .rpm, bintar).
Any comments? Here are some more points:
- I tested that gen_profile_load gives a good speedup of sysbench read-only
(around 30%, so still very significant even though it generates a different
and more varied load).
- As another test, I removed all SELECT from gen_profile_load, and ran the
resulting PGO binary with sysbench read-only. This still gave a fair
speedup, despite the PGO load being completely different from the benchmark
load. This gives me confidence that the PGO should not cause performance
regressions in cases not covered well by gen_profile_load
- More tests would be nice, of course. Axel, would you be able to build some
binaries following above procedure, and test some different random
benchmarks? Anything that is easy to run could be interesting, both to test
for improvement, and to check against regressions.
- We probably need a recent GCC version to get good results. I used GCC
version 4.7.2. Maybe we should install this GCC version in all the VMs we
use to build binaries?
- Should we do this in 5.5? I think we might want to. The speedup is quite
significant, and it seems very safe - no code modifications are involved,
only different compiler options.
Any thoughts? Volunteeres for helping with the cmake or buildbot parts?
- Kristian.
Follow ups