maria-developers team mailing list archive

Thread
Date
Re: Analysing degraded performance at high concurrency in sysbench OLTP

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Sergey Vojtovich <svoj@xxxxxxxxxxx>
Date: Tue, 29 Apr 2014 15:21:01 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, Axel Schwenke <axel@xxxxxxxxxxxxxxxx>
In-reply-to: <87mwf44dq1.fsf@frigg.knielsen-hq.org>
User-agent: Mutt/1.5.21 (2010-09-15)
Hi Kristian,

On Tue, Apr 29, 2014 at 12:44:22PM +0200, Kristian Nielsen wrote:
> At the Barcelona meeting in January, I promised to take a look at the
> high-concurrency sysbench OLTP benchmarks, and now I finally had the time do
> do this.
Thanks for looking at it!

> 
> There was a lot of work on LOCK_open by Svoj and Serg. If I have understood
> correctly, the basic problem was that at high concurrency (like, 512 threads),
> the TPS is only a small fraction of the peak throughput at lower concurrency.
> Basically, the server "falls over" and starts trashing instead of doing real
> work, due to some kind of inter-processor communication overhead.
There are quite a few issues around scalability. The one that I was attempting
to solve was like: MariaDB generates intensive bus traffic when run on different
NUMA nodes. I suppose even 2 threads running on different nodes will be
affected.

It happens due to writes to shared memory location. Especially mutex performing
spin-locks seem to generate a lot of bus traffic.

Subsystem that mostly affect scalability are:
1. THR_LOCK - per-share
2. table cache - now mostly per-share
3. InnoDB

> 
> I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr
> revno:4151 (revid:svoj@xxxxxxxxxxx-20140415072957-yeir4jvokyilw5hp). I
> compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP.
> 
> (I just realised that my runs are with 32 tables, while I think the benchmarks
> in January focused on single-table runs. Maybe I need to re-do my analysis
> with the single-table benchmark, or perhaps it is too artificial to matter
> much?).
Yes, the benchmark was focused on single-table runs. Starting with 10.0.10 we
eliminated LOCK_open in favor of per-share mutex. It means single-table runs
scalability issues should remain, but multi-table runs scalability issues
should be solved.

> 
> In the read-only sysbench, the server mostly does not fall over. I guess this
> is due to the work by Svoj on eliminating LOCK_open?
Likely. I would gladly interpret benchmark results if there are any. :)

Since I didn't analyze InnoDB internals wrt scalabilty yet, I'd better stay
away from commenting the rest of e-mail.

Thanks,
Sergey

> 
> But in read-write, performance drops dramatically at high concurrency. TPS
> drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here
> are approximate only, they vary somewhat between different runs).
> 
> So I analysed the r/w benchmark with the linux `perf` tool. It turns out
> two-thirds of the time is spent in a single kernel function _raw_spin_lock():
> 
>   -  66.26%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock
> 
> Digging further using --call-graph, this turns out to be mostly futex waits
> (and futex wakeups) from inside InnoDB locking primitives. Calls like
> sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in
> particular.
> 
> So this is related to the non-scalable implementation in InnoDB of locking
> primitives, which is a known problem. I think Mark Callaghan has written about
> it a couple of times. Last time I looked at the code, every single mutex wait
> has to take a global mutex protecting some global arrays and stuff. I even
> remember seeing code that at mutex release would pthread_signal_broadcast()
> _every_ waiter, all of them waking up, only to all (except one) go do another
> wait. This is a kiler for scalability.
> 
> While investigating, I discovered the variable innodb_sync_array_size, which I
> did not know about. It seems to split the mutex for some of the
> synchronisation operations. So I tried to re-run the benchmark with
> innodb_sync_array_size set to 8 and 64. In both cases, I got significant
> improvement, TPS increase to 5900, twice the value with innodb_sync_array_size
> set to the default of 1.
> 
> So it is clear that the main limitation in this benchmark was the non-scalable
> InnoDB synchronisation implementation. After tuning innodb_sync_array_size,
> time spent in _raw_spin_lock() is down to half what it was before (33% of
> total time):
> 
>   +  33.77%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock
> 
> Now investigating call-graphs show that the sync_array operations are much
> less visible. Instead mutex_create_func(), called from
> dict_mem_table_create(), is the one that turns up prominently in the profile.
> I am not familiar with what this part of the InnoDB code is doing, but what I
> saw from a quick look is that it creates a mutex - and there is another global
> mutex needed for this, which again limits scalability.
> 
> It is a bit surprising to see mutex creation being the most significant
> bottleneck in the benchmark. I would have assumed that most mutexes could be
> created up-front and re-used? It is possible that this is a warm-up thing,
> maybe the code is filling up the buffer pool or some table-cache like thing
> inside InnoDB? Because I see TPS being rather low for the first 150 seconds of
> the run (around 3000), and then increasing suddenly to around 8000-9000 for
> the rest. This might be worth investigating further.
> 
> So in summary, my investigations found that the bottleneck in this benchmark,
> and the likely cause of the fall-over, is a scalability problem with InnoDB
> locking primitives. The sync_array part seems to be mitigated to some degree
> by innodb_sync_array_size, the mutex creation part still needs to be
> investigated.
> 
> I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does
> anyone know? I vaguely recall reading something about it, but I am not sure.
> It would seem a waste to duplicate their efforts.
> 
> In any case, I hope this was useful. As part of this investigation, I
> installed a new 3.14 kernel on the lizard2 machine and a new `perf`
> installation, which seems to work well to do more detailed investigations of
> these kind of issues. So let me know if there are other benchmarks that I
> should look into. One thing that could be interesting is to look for false
> sharing; there are some performance counters that Intel manuals describe can
> be used for this.
> 
> As an aside: In my tests, once concurrency becomes high enough that the server
> falls over, the actual TPS number becomes mostly meaningless. Eg. I saw
> putting dummy pause loops into the code increasing TPS. If TPS stabilises at
> N% of peak throughput as concurrency goes to infinity, then we can compare
> N. But if N goes to zero as concurrency goes to infinite, I think it is
> meaningless to compare actual TPS numbers - we should instead focus on
> removing the fall-over behaviour.
> 
> (Maybe this is already obvious to you, I have not followed the previous
> benchmark efforts that closely).
> 
> Hope this helps,
> 
>  - Kristian.
References

Analysing degraded performance at high concurrency in sysbench OLTP
From: Kristian Nielsen, 2014-04-29