← Back to team overview

maria-developers team mailing list archive

Re: Next steps in improving single-threaded performance

 

Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

> Axel Schwenke <axel@xxxxxxxxxxxx> writes:
>
>> Benchmark 1 is good old sysbench OLTP. I tested 10.0.7 vs. 10.0.7-pgo. With
>> low concurrency there is about 10% win by PGO; however this is completely
>> reversed at higher concurrency by mutex contention (the test was with
>> performance schema disabled, so cannot say which mutex, probably LOCK_open).
>
> Ouch, pgo drops the throughput to 1/2!
>
> That's a pretty serious blow to the whole idea, unless there is not just a fix
> but also a good explanation. I will investigate this, thanks a lot for
> testing!

Ok, so I finally got the time to investigate this. I think I understand what
is going on.

So the original problem was that PGO (profile-guided optimisation) showed a
fair improvement at lower concurrency, but a significant reduction in
throughput at higher concurrency, in sysbench OLTP.

It turns out that the real problem is unrelated to PGO. At higher concurrency,
the server code basically falls over, so that adding more concurrent work
significantly decreases the throughput. This is a well-known phenomenon.

As a side effect, if we improve the code performance of a single thread, we
effectively increase the concurrency in the critical spots - threads spend
less time executing the real code, hence more time in concurrency
bottlenecks. The end result is that _any_ change that improves single-threaded
performance causes throughput to decrease at concurrency levels where the code
falls over.

To verify this, I repeated XL's sysbench runs on a number of different mysqld
servers. Apart from XL's original 10.0 and 10.0-pgo, I added a run with _no_
optimisations (-O0), and some runs where I used PGO but deliberately decreased
performance by putting a dummy loop into the query execution code. Here are
the results for sysbench read-write:

Transactions per second in sysbench OLTP read-write (higher is better):

                          16-rw    128-rw   256-rw   512-rw
    10.0-nopgo          6680.84  13004.87  7850.10  4031.06
    10.0-pgo            7249.39  12199.32  6336.47  2614.58
    10.0-pgo-pause1000  7040.25  12081.80  5825.99  2464.58
    10.0-pgo-pause2000  6774.10  12024.44  5810.60  2433.14
    10.0-pgo-pause4000  6469.06  12859.23  6479.85  2589.90
    10.0-pgo-pause8000  5779.67  13233.35  7074.85  2741.01
    10.0-pgo-pause16000 4710.97  12286.62  7896.23  2889.25
    10.0-noopt          4004.37   9613.89  7920.67  3268.46

As we see, there is a strong correlation between higher throughput at low
concurrency, and lower throughput at high concurrency. As we add more dummy
overhead to the PGO server, throughput at the high concurrency increases, and
compiling with -O0 is even faster.

The sysbench read-only results are similar, though less pronounced as the code
now does not fall over so badly (I used recent 10.0 bzr, maybe Svoj's work on
LOCK_open has helped solve the problem, or maybe me compiling performance
schema out made a difference):

Transactions per second in sysbench OLTP read-write (higher is better):

                          16-ro    128-ro    256-ro    512-ro
    10.0-nopgo          8903.62  19034.44  18369.42  15933.65
    10.0-pgo            9602.81  20057.09  19084.66  13128.61
    10.0-pgo-pause1000  9169.94  20403.00  18814.08  12708.24
    10.0-pgo-pause2000  8870.11  20307.68  18618.01  13015.76
    10.0-pgo-pause4000  8331.52  19903.76  18425.81  13459.38
    10.0-pgo-pause8000  7610.22  18897.86  17650.32  13544.74
    10.0-pgo-pause16000 6079.60  16654.55  15853.86  14008.75
    10.0-noopt          4969.67  12830.43  12263.99  11438.69

Again, at the concurrency levels where pgo is slower than non-pgo, we can
improve throughput by inserting dummy pause loop code.

So the conclusion here is that PGO is actually a viable optimisation (and we
should do it for the binaries we release, and if possible integrate it into
the cmake builds so from-source builds will also benefit from it). The
high-concurrency sysbench results are meaningless in terms of single-threaded
improvements, as any such improvement ends up degrading TPS, and the real
problem needs to be fixed elsewhere, by removing lock contention and so on.

I will try next to investigate why the code falls over at high concurrency and
see if anything can be done...

It also appears that sysbench results at high concurrency are mostly
meaningless for comparison between different code versions, unless we can see
that one version falls over and the other does not. Hopefully we can find a
way to eliminate the catastrophic performance hit at high concurrency...

 - Kristian.


Follow ups

References