← Back to team overview

maria-developers team mailing list archive

Analysing degraded performance at high concurrency in sysbench OLTP

 

At the Barcelona meeting in January, I promised to take a look at the
high-concurrency sysbench OLTP benchmarks, and now I finally had the time do
do this.

There was a lot of work on LOCK_open by Svoj and Serg. If I have understood
correctly, the basic problem was that at high concurrency (like, 512 threads),
the TPS is only a small fraction of the peak throughput at lower concurrency.
Basically, the server "falls over" and starts trashing instead of doing real
work, due to some kind of inter-processor communication overhead.

I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr
revno:4151 (revid:svoj@xxxxxxxxxxx-20140415072957-yeir4jvokyilw5hp). I
compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP.

(I just realised that my runs are with 32 tables, while I think the benchmarks
in January focused on single-table runs. Maybe I need to re-do my analysis
with the single-table benchmark, or perhaps it is too artificial to matter
much?).

In the read-only sysbench, the server mostly does not fall over. I guess this
is due to the work by Svoj on eliminating LOCK_open?

But in read-write, performance drops dramatically at high concurrency. TPS
drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here
are approximate only, they vary somewhat between different runs).

So I analysed the r/w benchmark with the linux `perf` tool. It turns out
two-thirds of the time is spent in a single kernel function _raw_spin_lock():

  -  66.26%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock

Digging further using --call-graph, this turns out to be mostly futex waits
(and futex wakeups) from inside InnoDB locking primitives. Calls like
sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in
particular.

So this is related to the non-scalable implementation in InnoDB of locking
primitives, which is a known problem. I think Mark Callaghan has written about
it a couple of times. Last time I looked at the code, every single mutex wait
has to take a global mutex protecting some global arrays and stuff. I even
remember seeing code that at mutex release would pthread_signal_broadcast()
_every_ waiter, all of them waking up, only to all (except one) go do another
wait. This is a kiler for scalability.

While investigating, I discovered the variable innodb_sync_array_size, which I
did not know about. It seems to split the mutex for some of the
synchronisation operations. So I tried to re-run the benchmark with
innodb_sync_array_size set to 8 and 64. In both cases, I got significant
improvement, TPS increase to 5900, twice the value with innodb_sync_array_size
set to the default of 1.

So it is clear that the main limitation in this benchmark was the non-scalable
InnoDB synchronisation implementation. After tuning innodb_sync_array_size,
time spent in _raw_spin_lock() is down to half what it was before (33% of
total time):

  +  33.77%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock

Now investigating call-graphs show that the sync_array operations are much
less visible. Instead mutex_create_func(), called from
dict_mem_table_create(), is the one that turns up prominently in the profile.
I am not familiar with what this part of the InnoDB code is doing, but what I
saw from a quick look is that it creates a mutex - and there is another global
mutex needed for this, which again limits scalability.

It is a bit surprising to see mutex creation being the most significant
bottleneck in the benchmark. I would have assumed that most mutexes could be
created up-front and re-used? It is possible that this is a warm-up thing,
maybe the code is filling up the buffer pool or some table-cache like thing
inside InnoDB? Because I see TPS being rather low for the first 150 seconds of
the run (around 3000), and then increasing suddenly to around 8000-9000 for
the rest. This might be worth investigating further.

So in summary, my investigations found that the bottleneck in this benchmark,
and the likely cause of the fall-over, is a scalability problem with InnoDB
locking primitives. The sync_array part seems to be mitigated to some degree
by innodb_sync_array_size, the mutex creation part still needs to be
investigated.

I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does
anyone know? I vaguely recall reading something about it, but I am not sure.
It would seem a waste to duplicate their efforts.

In any case, I hope this was useful. As part of this investigation, I
installed a new 3.14 kernel on the lizard2 machine and a new `perf`
installation, which seems to work well to do more detailed investigations of
these kind of issues. So let me know if there are other benchmarks that I
should look into. One thing that could be interesting is to look for false
sharing; there are some performance counters that Intel manuals describe can
be used for this.

As an aside: In my tests, once concurrency becomes high enough that the server
falls over, the actual TPS number becomes mostly meaningless. Eg. I saw
putting dummy pause loops into the code increasing TPS. If TPS stabilises at
N% of peak throughput as concurrency goes to infinity, then we can compare
N. But if N goes to zero as concurrency goes to infinite, I think it is
meaningless to compare actual TPS numbers - we should instead focus on
removing the fall-over behaviour.

(Maybe this is already obvious to you, I have not followed the previous
benchmark efforts that closely).

Hope this helps,

 - Kristian.


Follow ups