maria-developers team mailing list archive

Thread
Date
Status on MDEV-4506, parallel replication

To: maria-developers@xxxxxxxxxxxxxxxxxxx, Michael Widenius <monty@xxxxxxxxxxxx>, Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 12 Jul 2013 10:39:45 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
So I have been working for some weeks now on implementing MDEV-4506. This task
is about making the slave apply (some) events in parallel threads, to speed up
replication and reduce the risk of slave not being able to keep up with a busy
master.

Events are applied in parallel on the slave if they were group-committed
together on the master. This is an easy way to detect transactions that are
independent. Note that this is transparent to applications; while transactions
are executed in parallel on the slave, they are still committed in the same
order as on the master.

I also added parallel execution of events with different GTID domain id. This
makes testing a lot easier (no need to carefully arrange timing to get a
specific group commit on the master), and also really is the whole point of
much of my hard work on GTID. So if we have multi-source M1->S1, M2->S1, and
S1->S2, then S2 will be able to execute events from M1 in parallel with those
from M2, just like S1 can. And the user can explicitly set the domain_id
different for eg. a long-running ALTER or UPDATE, and this way get it to run
in parallel and not cause a huge replication delay.

----

On the master, I added to each GTID event a commit_id. If two transactions
group-commit together, they are binlogged with the same commit_id; if not,
they get different commit_ids. Thus, the slave can detect the possibility of
executing two transactions in parallel by checking if the commit_ids are
equal.

On the master, I implemented --binlog-commit-wait-count=N and
--binlog-commit-wait-usec=T. A transaction will wait at most T microseconds
for at least N transactions to queue up and be ready for group commit. This
allows to deliberately delay transactions on the master in order to get bigger
group commits and thus better opportunity for parallel execution (and again it
makes testing easier).

On the slave, I implemented --slave-parallel-threads=N. If N>0, that many
threads will be spawned, and events will be executed in parallel (if possible)
by those threads.

----

The current code is pushed here (it is based on 10.0-base):

    lp:~maria-captains/maria/10.0-knielsen

It is still far from finished, but it now works sufficiently that I could do
some basic benchmarking (just on my laptop, completely unscientifically).

First, I prepared a binlog on the master with plenty of opportunity for
parallel replication. I started the master with --binlog-commit-wait-count=20
--binlog-commit-wait-usec=1000000. I then ran this Gypsy script:

-----------------------------------------------------------------------
i|1|DROP TABLE IF EXISTS t1
i|1|CREATE TABLE t1 (a INT PRIMARY KEY, b VARCHAR(100)) ENGINE=InnoDB
p|1|REPLACE INTO t1 (a,b) VALUES (? MOD 10000, ?)|int,varchar /home/knielsen/my/gypsy/words
-----------------------------------------------------------------------

  gypsy --queryfile=simple_replace_load.gypsy --duration=20 --threads=40

This results in a binlog with about 65k updates to the table, group-committed
in batches of 20 transactions.

I then started a fresh slave and let it replicate everything with START SLAVE
UNTIL; the time to replicate all the events is then easy to see in the slave
error log.

The time to replicate everything with unmodified 10.0-base was 99
seconds. With --slave-parallel-threads=25, it was just 22 seconds. So that is
a 4.5 times speedup, which is quite promising. Also note that at 22 seconds,
the slave is within 10% of the speed of the master.

----

But as I said, there is still significant work left to do. I put a ToDo list
at the top of sql/rpl_parallel.cc. Some of the big remaining issues:

1. The existing code is not thread-safe for class Relay_log_info. This class
contains a bunch of stuff that is specific to executed transactions, not
related to relay-log at all. This needs to be moved to the new struct
rpl_group_info I introduced, and all code updated to pass around a pointer to
that struct instead. There may also be a need to add additional locking on
Relay_log_info, existing code needs review for this.

2. Error handling needs to be implemented, it is rather more complex in the
parallel case. If one transaction fails (and retry also fails), then we need
to somehow get hold of all later transactions that are in the process of
parallel execution, and abort them + roll them back. Otherwise we get
inconsistent binlog position for the next slave restart.

3. In the old code, when the SQL thread is stopped, it has logic to let the
current event group (=transaction) replicate to completeness first, with a
timeout to force a stop in the middle of the event group if eg. the master has
disappeared. This logic needs to be re-implemented to work when having any
number of event groups executing in parallel. (It is important to let the
groups complete execution when doing non-transactional stuff that cannot be
rolled back, otherwise again the slave position becomes inconsistent for next
slave restart).

So as you see, there is quite a bit of work left on this (as well as on
GTID). So I would very much welcome any help on this to avoid causing delays
for 10.0-GA ...

 - Kristian.