← Back to team overview

maria-developers team mailing list archive

Re: MariaDB multi-source replication testing at Booking.com

 

Karoly Nagy <karoly.nagy@xxxxxxxxxxx> writes:

> We're seeing very high and fluctuating mutex contentions while
> replicating from two sources (Oracle MySQL 5.6) to a single MariaDB
> slave. You can see that on the graphs below. The spin waits are
> relatively [1] aligned but the mutex rounds [2] are 5-10 times higher
> than it is on the two sources combined together and not consistent.
> The sources have a relatively constant pattern while the target has
> dips around 2.5k and spikes up to 8k. The os waits are in completely
> different order of magnitude [3].
>
> The scenario where values were captured:
>
> * Multi-source target is replicating the full dataset of `source 2`
>   and a subset of `source 1` (the hot data) - MariaDB 10.0.16
> * Both sources are MySQL 5.6 being part of their replication chain as
>   slaves with log_slave_updates
> * Source 2 is in normal mode - Oracle MySQL 5.6.17
> * Source 1 is catching up from a 1 day replication delay - Oracle
>   MySQL 5.6.24
> * All the slaves are warm having the buffer pool fully populated 
>
> Is this behavior expected?

So if I understand correctly, what is compared here is the value of some
InnoDB statistics between two MySQL 5.6 servers each running a single
replication SQL thread, and a MariaDB 10.0 server running two replication
SQL threads (multi-source replication).

I do not have much experience with interpreting InnoDB mutex wait
statistics, hopefully some with more experience on this can contribute. But
it does seem somewhat expected that a server with two threads has a much
higher potential for mutex contention (mutex rounds and os waits) than a
server using only a single thread, right?

Did you try comparing the numbers when only one thread is running on the
MariaDB slave (eg. stopping first one of the multisource connections, then
the other) ?

Did you try comparing the configurations of the three servers for any
relevant differences?

What are the corresponding statistics on the original masters generating the
load?

Did you try to determine which individual mutexes are mostly contributing to
the differences (just total number of mutex waits is a somewhat crude
statistics which might be hard to interpret)?

Do you have any indication that these differences are causing problems with
performance, or are you just curious to understand them?

Hope this helps,

 - Kristian.


Follow ups

References