← Back to team overview

maria-developers team mailing list archive

Next steps in improving single-threaded performance

 

I have been analysing CPU bottlenecks in single-threaded sysbench read-only
load. I found that icache misses is the main bottleneck, and that
profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25%
or more.

(More details in my blog posts:

    http://kristiannielsen.livejournal.com/17676.html
    http://kristiannielsen.livejournal.com/18168.html
)

Now I would like to ask for some discussions/help in how to get this
implemented in practice. It involves changing the build process for our
binaries: First compile with gcc --coverage, then run some profile workload,
then recompile with -fprofile-use.

I implemented a simple program to generate some profile load:

    https://github.com/knielsen/gen_profile_load

It runs a bunch of simple insert/select/update/delete, with different
combinations of storage engine, binlog format, and client API. It is designed
to run inside the build tree and handle starting and stopping the server being
tested, so it is pretty close to a working setup. These commands work to
generate a binary that is faster due to PGO:

  mkdir bld
  cd bld
  cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" ..
  make

  tests/gen_profile_load

  cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction"
  make

So all the pieces really are there, it should be possible to implement it. But
we need to find a good way to integrate it into our build system.

The best would be to integrate it into our cmake files.

The gen_profile_load.c could go into tests/, ideally we would build both a
static and dynamically linked version (so we get PGO for both libmysqlclient.a
and libmysqlclient.so). Anyone can help me get cmake to do that?

And it would be cool if we could get the above procedure to work completely
within cmake, so that the user could just do:

    cmake -DWITH_PGO ... ; make

and cmake would itself handle first building with --coverage, then running
gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with
-fprofile-use. Anyone know if this is possible with cmake, and if so could
help implement it?

But alternatively, we could integrate a double build, like the commands above,
into the buildbot scripts (.deb, .rpm, bintar).

Any comments? Here are some more points:

 - I tested that gen_profile_load gives a good speedup of sysbench read-only
   (around 30%, so still very significant even though it generates a different
   and more varied load).

 - As another test, I removed all SELECT from gen_profile_load, and ran the
   resulting PGO binary with sysbench read-only. This still gave a fair
   speedup, despite the PGO load being completely different from the benchmark
   load. This gives me confidence that the PGO should not cause performance
   regressions in cases not covered well by gen_profile_load

 - More tests would be nice, of course. Axel, would you be able to build some
   binaries following above procedure, and test some different random
   benchmarks? Anything that is easy to run could be interesting, both to test
   for improvement, and to check against regressions.

 - We probably need a recent GCC version to get good results. I used GCC
   version 4.7.2. Maybe we should install this GCC version in all the VMs we
   use to build binaries?

 - Should we do this in 5.5? I think we might want to. The speedup is quite
   significant, and it seems very safe - no code modifications are involved,
   only different compiler options.

Any thoughts? Volunteeres for helping with the cmake or buildbot parts?

 - Kristian.


Follow ups