← Back to team overview

maria-developers team mailing list archive

Re: Next steps in improving single-threaded performance


Hi Kristian,

just out of curiosity: is it possible to find out which functions cause highest
amount of icache misses? Can it have anything to do with branch misprediction?


On Fri, Jan 24, 2014 at 03:51:25PM +0100, Kristian Nielsen wrote:
> I have been analysing CPU bottlenecks in single-threaded sysbench read-only
> load. I found that icache misses is the main bottleneck, and that
> profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25%
> or more.
> (More details in my blog posts:
>     http://kristiannielsen.livejournal.com/17676.html
>     http://kristiannielsen.livejournal.com/18168.html
> )
> Now I would like to ask for some discussions/help in how to get this
> implemented in practice. It involves changing the build process for our
> binaries: First compile with gcc --coverage, then run some profile workload,
> then recompile with -fprofile-use.
> I implemented a simple program to generate some profile load:
>     https://github.com/knielsen/gen_profile_load
> It runs a bunch of simple insert/select/update/delete, with different
> combinations of storage engine, binlog format, and client API. It is designed
> to run inside the build tree and handle starting and stopping the server being
> tested, so it is pretty close to a working setup. These commands work to
> generate a binary that is faster due to PGO:
>   mkdir bld
>   cd bld
>   cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" ..
>   make
>   tests/gen_profile_load
>   cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction"
>   make
> So all the pieces really are there, it should be possible to implement it. But
> we need to find a good way to integrate it into our build system.
> The best would be to integrate it into our cmake files.
> The gen_profile_load.c could go into tests/, ideally we would build both a
> static and dynamically linked version (so we get PGO for both libmysqlclient.a
> and libmysqlclient.so). Anyone can help me get cmake to do that?
> And it would be cool if we could get the above procedure to work completely
> within cmake, so that the user could just do:
>     cmake -DWITH_PGO ... ; make
> and cmake would itself handle first building with --coverage, then running
> gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with
> -fprofile-use. Anyone know if this is possible with cmake, and if so could
> help implement it?
> But alternatively, we could integrate a double build, like the commands above,
> into the buildbot scripts (.deb, .rpm, bintar).
> Any comments? Here are some more points:
>  - I tested that gen_profile_load gives a good speedup of sysbench read-only
>    (around 30%, so still very significant even though it generates a different
>    and more varied load).
>  - As another test, I removed all SELECT from gen_profile_load, and ran the
>    resulting PGO binary with sysbench read-only. This still gave a fair
>    speedup, despite the PGO load being completely different from the benchmark
>    load. This gives me confidence that the PGO should not cause performance
>    regressions in cases not covered well by gen_profile_load
>  - More tests would be nice, of course. Axel, would you be able to build some
>    binaries following above procedure, and test some different random
>    benchmarks? Anything that is easy to run could be interesting, both to test
>    for improvement, and to check against regressions.
>  - We probably need a recent GCC version to get good results. I used GCC
>    version 4.7.2. Maybe we should install this GCC version in all the VMs we
>    use to build binaries?
>  - Should we do this in 5.5? I think we might want to. The speedup is quite
>    significant, and it seems very safe - no code modifications are involved,
>    only different compiler options.
> Any thoughts? Volunteeres for helping with the cmake or buildbot parts?
>  - Kristian.
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp

Follow ups