← Back to team overview

maria-developers team mailing list archive

Re: [Maria-discuss] Pipeline-stype slave parallel applier


andrei.elkin@xxxxxxxxxx writes:

> I am yet to check closer thd_rpl_deadlock_check(), perhaps that's why


I definitely think you should get well acquainted with existing parallel
replication in MariaDB to undertake a project as ambitious as this one.
There are a number of central concepts and algorithms that synergies
together, including (off the top of my head):

 - binlog group commit, which uses the commit_ordered mechanism to
   efficiently perform commits for multiple threads and coordinate with
   storage engines.

 - wait_for_prior_commit, which works with the binlog group commit to
   scalably handle dependencies between replication workers.

 - thd_rpl_deadlock_check(), which works with supporting storage engines to
   detect conflicts between transactions replicated in parallel.

 - rpl_parallel_entry and group_commit_orderer, which efficiently
   coordinates scheduling between worker threads (Jean François' benchmarks
   show this scaling to 5000 threads).

 - And behind it all the strict ordering of commits, which underlies many
   assumptions around parallel replication as well as GTID and allows things
   like optimistic/aggressive parallel replication in a way that is
   completely transparent to applications.

One thing about pipeline-style slave parallel applier is still not clear to
me. How will conflicts between individual statements be handled? Eg. suppose
we have this group commit (epoch) on the master:

  BEGIN;  /* T1 */
  INSERT INTO t1 VALUES  (...);  /* S1 */
  BEGIN;  /* T2 */
  UPDATE t1 SET b=10 WHERE a=1;  /* S2 */
  UPDATE t1 SET b=20 WHERE a=1;  /* S3 */

So the idea is to schedule these two transactions (T1 T2) to three different
worker threads, right? Each thread gets one of (S1 S2 S3).
But we must not commit S3 before S2, obviously, or we end up with incorrect
result. So how will this dependency be handled?

I was assuming that this would be using the same mechanism as optimistic
parallel replication - in fact, I thought the point was that this is
precisely a clever generalisation of optimistic parallel replication,
realising that transaction boundaries need not be respected. And since epoch
does not help with the dependency between S2 and S3, I was confused of the
point of using it to detect lack of dependency between T1 and T2...

But if you say that you had not considered using the thd_rpl_deadlock_check
mechanism, then I am confused as to how to detect that S3 needs to wait for

> Assume out of order commit of the grouped original transaction,
> GTID gaps would exist some little time but as long as there no crashes
> slave conservative (the group boundaries respective that is) execution
> would fill them;
> and in case of crash recovery Slave could specify
> the gaps at reconnecting (unless those transactions are the relay-log)
> for retransmitting.
> They would be logged out-of-order in the log-bin case, and may appearing
> as set (mathematically). However it's a range as the union
> of gtids, and it - the master range -  can be
> preserved as described along replication chain.

I don't understand how MariaDB GTID can work unless the slave binlogs the
GTIDs in the exact same order as the master. At any time, a lower-level
slave may connect, and you have to find the right point at which to start
sending it events. There is only a single GTID to start from.

Remember, it is not guaranteed that the GTIDs are in strict sequence number
ordering from the master! Things like master-master ring or transactions
executed directly on the slave can break ordering when gtid_strict_mode is
not set. Suppose you have GTIDS 0-1-100,0-1-99 in the log and a lower-level
slave connects at GTID 0-1-100. How will you know whether you need to send
0-1-99 to the slave (because of out-of-order master) or if you need to skip
it (because we did parallel execution out-of-order)?

Or if binlogging is changed so that it uses the slave relay logs instead
(identical to master binlogs), but replicated transactions are committed to
storage engines out of order. Then how do we remember which parts are
committed in storage engines, so we can recover from a crash?

In current MariaDB parallel replication, the enforced commit ordering
ensures that a single GTID position is enough to recover the slave from a
crash, as well as to connect a lower-level slave. It also guarantees to
users that they can enable parallel replication without introducing any new

> Right, those 40% of rollbacks we are going to trade with extra
> productivity :-).

But the point is - those 40% rollbacks were beneficial. The main cost of the
transaction was the the I/O to read affected pages from disk - so running
future transactions in parallel is a win even if we know that they need to
be rolled back and retried later. I/O latency is perhaps the most important
slave bottleneck to address with parallel replication, and speculative apply
can work as a pre-fetcher.

> Very true. On one hand we can go blind "speculatively" direct ready to
> pay rollback price, on the other already Master does detect (some) conflicts, and
> the actual conflict-free range of transactions is wider than a mere one
> binlog group (mind a reference to the logical clock that can "measure" dependencies).

Can you btw. point me to a detailed description of how logical clock works?
My impression is that this has still limited scope for parallellisation if
the master has fast commits, but not knowing the details I may be missing

> Although the assignment role of the producer is clearly better off be
> taken by the workers. What I meant in the original mail
> that the assignment mechanism can be improved to avoid any chance of
> worker idling but more importantly to facilitate fairness.
> The workers go to pick pieces of work at once they become
> available. The optimistic scheduler assignment is close to that but
> it still assigns only round robin that is speculatively.

I don't understand "the assignment role of the producer is clearly better
off be taken by the workers". How would this reduce workers idling? I do not
think the current parallel replication scheduler leaves workers idle?

On the other hand, if workers are to individually pull work from a shared
queue, then they will all be contending for access to the queue (imagine
5000 threads doing this at once...). The idea with the existing scheduler is
that this does not happen - each worker has its own queue, so contention is

(Just to be clear, I'm trying to understand here, the current MariaDB
scheduler is obviously not perfect).

>> In fact, if the existing mechanism can be slightly extended, maybe it can
>> already do much of what is needed to ensure consistency. Eg. suppose we have
>> 3 original transactions each with two statements:
>>   e^1_1 e^1_2  e^2_1 e^2_2  e^3_1 e^3_2
>> Suppose the first 3 of these are ready to commit at some point:
>>   e^1_1 e^1_2 e^2_1
>> The leader can check that it doesn't try to commit only part of an original
>> transaction.
> Probably I will correct your initial perception here (otherwise I am
> lost in how e^2_1 got separated).
> In this case you mean three (original) T:s (E) are mapped into a modified
> T'_1 = {e^1_1 e^1_2 e^2_1}
> and some more e.g just one T'_2 unions of
>  T'_2 \cup T'_1 =  T_1 \cup T_2 \cup T_3  =  E
> As the leader can't initiate 2p-commit without T'_2.

To illustrate what I mean, take my earlier example, with two transactions T1
and T2 containing individual statements S1, S2, and S3.

My understanding is that on the slave, we will start 3 transactions for S1,
S2, and S3, each running in its own worker threads. Right?

So suppose S1 and S2 finish and are ready to complete, but S3 is still
running. Now, we cannot just commit S1 and S2, as that would leave us in an
inconsistent state (in case of crash for example). So we should either
commit just S1 at this point. Or we could wait for S3 and then group commit
all of S1, S2, and S3 together. Right?

Or is there some mechanism by which two different threads could run S2 and
S3 in parallel, but still within a single transactions (inside InnoDB eg?) -
I imagine this is not the case.

> Sorry, previously I omitted one important detail. A Worker
> pulls from the share queue its tail statement not only for itself.
> When the pull result is a dependent statement whose parent has been or is in
> processing by another worker, that statement will be redirected to that
> worker.

Right - but how will you determine whether there is such a dependency? It
seems very hard to do in the general case at pull time (where we haven't
even parsed the statement yet)?

> With some efforts we can make use of relay-log for both recovery and
> as binlog for further replication. In above I hinted at the latter.
> So why won't the slave worker "appropriate" the master binlog images? :-)
> While the master and slave server version are the same this optimization
> must be feasible.

Aha, I see. This is something like a built-in binlog server.

I agree this seems feasible. Surely it is a separate project from this (of
course one may benefit from the other).

In a sense, this is the "natural" way to do things. Why go to all the
trouble of trying to re-produce the original binlog from executing
individual statements, when we already have an (almost) copy directly from
the master?

Still, this may not be as easy as that... for example, how do we interleave
the multiple binlogs from multi-source replication in a way that is
consistent with the actual execution order? And what about extra
transactions executed directly on the slave?


 - Kristian.

Follow ups