maria-developers team mailing list archive

Thread
Date
WL#186 New (by Knielsen): Parallel replication using interleaved binlog events to detect potential parallelism

To: maria-developers@xxxxxxxxxxxxxxxxxxx
From: worklog-noreply@xxxxxxxxxxxx
Date: Mon, 21 Mar 2011 10:45:53 +0000 (UTC)
-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Parallel replication using interleaved binlog events to detect
		potential parallelism
CREATION DATE..: Mon, 21 Mar 2011, 10:45
SUPERVISOR.....: 
IMPLEMENTOR....: 
COPIES TO......: 
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 186 (http://askmonty.org/worklog/?tid=186)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:



DESCRIPTION:

Similar to MWL#169 and MWL#184, this worklog describes a way to discover
binlog events that can be executed in parallel on the slave, improving
scalability with increasing CPU and I/O resources.

This worklog only deals with parallel replication of row-based
events. Statement-based events should still be executed in serial order.
We also require that commits on the slave still happen in the same order as on
the master (this guarantees that applications can never see a state on a slave
that did not exist on the master, among other nice properties).


Basic idea
----------

The idea is to work with row-based events. Suppose that on the master, we have
two parallel transactions running, T_1 and T_2. Suppose that we have the
following sequence of events:

1. T_2 modifies a row R_1, resulting in row event E_1

2. T_1 commits

3. T_2 commits.

In this case, it is safe on a slave to apply the event E_1 of T_2 in parallel
with any events of transaction T_1. Since we know T_1 committed before T_2, we
know that T_1 did not have to wait for T_2 to release any row lock of R_1 in
commit, so T_1 could not have modified R_1 after T_2 did. Similarly, since we
know that T_2 executed E_1 before T_1 committed, we know E_1 did not have to
wait for T_1 to release any row locks on R_1, so T_1 could not have modified
R_1 before T_2 did. Thus, we know that T_1 and T_2 are not in conflict on this
particular row R_1 and row event E_1.

Thus, on a slave, with this knowledge we can start transaction T_2 and execute
E_1 (in a separate thread from T_1 events) while T_1 is still running.


Refinement: optimistic apply
----------------------------

Suppose now that instead we have the following sequence of events:

1. T_3 commits

2. T_4 modifies a row R_2, resulting in row event E_2

3. T_4 commits

In this case, we can not be sure that event E_2 is independent of transaction
T_3. It could well be that T_3 also modified R_2, and E_2 was waiting for T_3
to commit and release row locks.

However, on a slave, we can nevertheless try to apply E_2 (in a separate
thread) while T_3 is still running. However, then we have to deal with the
situation that T_3 gets stuck waiting for T_4 to commit and release the lock
on R_2.

In this situation, we have a deadlock; T3 is waiting for T_4, but T_4 also
needs to wait for T_3 (as we want to commit in the right order, and even more
so want the end result to have the changes of T_4 in row R_2, not those of
T_3). So we need to abort and restart one of the transactions. As long as we
choose always to abort the transaction with the later commit time (T_4 in this
case), things will work out: We will abort T_4, commit T_3, then re-start and
commit T_4.

Using optimistic apply has the potential to increase parallelism on the
slave. However, this must be carefully weighted against the extra cost of
having to roll back and retry some transactions. So it needs suitable
status variables to be able to monitor number of transactions and number of
events that were rolled back and re-started (as well as number of transactions
and events successfully applied). And user options to control if and to what
extend to try optimistic apply.

But note that since we always have at least one transaction running that will
not be aborted due to optimistic apply (the one with earliest commit), and
since all other transactions run in parallel, the cost of restart may not be
too bad, as long as we do not start to starve the CPU(s). Any extra I/O reads
done in the aborted transaction to bring in pages to the buffer pool will in
any case benefit us when we immediately re-start the transaction.


Requirements
------------

Implementing this worklog will require several non-trivial extensions of the
server and storage engine code (these should eventually be filed as separate
worklog items):

1. Ability for server layer to detect and control what to do in case of
conflicts between two transactions. Currently, the storage engine
(eg. InnoDB/XtraDB) will just let the latter access block waiting for the
first transaction to commit, and in the case of a deadlock the storage engine
will choose an arbitrary transaction to roll back. For this worklog, the
server layer (replication slave SQL thread(s)) must be able to (a) cause abort
of the former transaction rather than wait of the second, and (b) choose which
of several possible transactions to abort in case of deadlock. Note that such
a mechanism is also needed for Galera replication, and the Galera wsrep patch
implements something like this for InnoDB already.


2. Interleaving of events from different transactions in the
binlog. Currently, the master buffers events during a transaction and writes
them all out sequentially during COMMIT; this destroys any knowledge about
which events and transactions ran in parallel on the master. We need to change
this so that events are tagged with a (local) transaction ID and written
out to the binlog as they occur, interleaved with events from other
transactions. We will probably also want to add "begin transaction" events and
"prepare" event, to increase the knowledge available on the slave about in
which order events occured.

We will still want to do _some_ buffering of events to avoid heavy contention
on writing to the binlog. We can buffer a small number of normal events; The
cost of such buffering is we may write a row event E_r _after_ a commit event
E_c where it could have been written before, which loses some opportunity for
(non-optimistic) parallel apply on the slave. We may however _not_ buffer
commit events; such buffering could wrongly put E_r ahead of E_c where it must
be after, causing an unexpected conflict in parallel apply on the slave.


3. A mechanism on the slave for distributing binlog events among multiple
threads and coordinating execution and commit among the threads. Such
mechanism should ideally be the same as used in MWL#184 and MWL#169.


Discussion
----------

One nice property of this worklog is that the parallelism exposed to the slave
is derived from whatever parallelism was there already on the master. We know
E_1 in T_2 ran in parallel with T_1 on the master, and use this to similarly
run E_1 in parallel with T_1 on the slave. This increases the likelyhood that
if parallelism is available on the master, it will also be available on the
slave (note that we will need parallelism on both master and slave to be able
to scale an entire replication topology).

----

For short OLTP-like transactions, it seems likely that this method will be
particularly suited to be combined with the method from MWL#184, which allows
to run in parallel on the slave all transactions that participate in a single
group commit on the master.

Eg. for single-row transactions T_1 (with row event E_1) and T_2 (with row
event E_2) that run in parallel on the master, we can have the following two
situations (among others):

1. E_1 and E_2 execute start            1. E_1 and E_2 execute start

2. E_1 execute finish                   2. E_1 execute finish

3. E_2 execute finish                   3. T_1 commit

4. T_1 commit                           4. E_2 execute finish

5. T_2 commit                           5. T_2 commit

In the scenario to the left, on a busy master the two commits are likely to be
done as a single group commit, and MWL#184 will allow parallelism on the
slave. In the scenario on the right, optimistic apply as per this worklog will
allow to run T_1 and T_2 in parallel on the slave.

----

Note that by interleaving events from different transactions in the binlog, we
will be writing events before they are committed, and so will need to issue
ROLLBACK statements in case of rollback.

Similarly, on the slave, we can start applying the events from a transaction
even before we have received the commit event from the master (in fact even
before such commit event has even occured on the master). This can potentially
reduce master->slave delay.

However, there is a need to be careful, as on the slave we need to keep the
correct commit order. And such order is not known until the commit events are
received. For example, if we had a limited number N of slave execution threads
and they all have started running T_1, ..., T_N, and we then receive a commit
event for T_(N+1), we would be deadlocked. It might be a good idea to only
start executing new transactions when we have at least one pending commit in
the relay log, or at least keep one spare thread available for execution of
one more transaction once we receive such commit event.

Similarly, if we get a conflict in opportunistic apply of transactions T_1 and
T_2, neither of which we have received the commit event from, then we will
need to wait for either commit event to arrive before we can decide which of
T_1 and T_2 to abort.

Note that in any case what we need to do is no worse than what we do in
currentl MySQL replication. There, we never do any execution until we have not
only seen but also executed any prior commit. Similarly, whatever buffering of
events we need to do on the parallel slave will be no worse than what we
currently have to do on the master (where we buffer _all_ events until
commit).


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)