maria-developers team mailing list archive

Thread
Date
Re: [Maria-discuss] Pipeline-stype slave parallel applier

To: andrei.elkin@xxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Sun, 25 Nov 2018 18:08:47 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <875zwoc19n.fsf@quad> (andrei elkin's message of "Thu, 22 Nov 2018 22:20:36 +0200")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)
andrei.elkin@xxxxxxxxxx writes:

> We can notice that in the pipeline method any attempt to start waiting for a granted lock
> to another branch needs interception and erroring out.
> We are bound to this because the grated lock is to be released only after 2pc.

So this is essentially what thd_rpl_deadlock_check() is used for. It allows
the parallel replication code to see all lock waits in storage engines.
Suppose T2 occurs after T1 in the binlog. If T2 waits for T1 that is fine,
T2 would in any case need to wait for T1 to commit first. But if T1 waits
for T2, that is not allowed, T1 must commit first. So we kill T2, allowing
T1 to continue. T2 is then re-tried once T1 committed.

So I think there might be a quick'n'dirty way to get a proof-of-concept of
the idea of replicating in parallel different statements in a single
transaction. In rpl_parallel::do_event(), just duplicate all (transactional)
GTID and commit/XID events. Then optimistic parallel replication will just
run all the statements individually in parallel, and in case of any
dependencies roll back and retry.

Though retry_event_group will need to be hacked as well so it knows to
re-try only its own single statement in an event group. And maybe something
else will break... but if it works, it could be a very quick way to do some
benchmarking on various workloads to get an idea of the potential benefit,
before starting to tackle all the difficult consistency requirements etc.


> Upon the statement errors out it will be reassigned
> to the lock grantee worker. If along this way the group still could not
> be finished, we could rollback all branches and turn to conservative
> scheduler, eventually to the sequential one.

So this is the idea that the scheduler will first analyse event groups for
dependencies, and then schedule them in a conflict-free way to worker
threads. The main problem to attack then becomes one of discovering
dependencies in a scalable way.

For a long time, this is also how I thought of the problem. It's not just
conservative mode, there were many ideas for doing this, some going back
more than 10 years. There were also patches for this, for example a very
interesting patch (not by me, should be in mailing list archive) that
analyzed row-events and scheduled conflicting transactions to the same
branch.

But eventually I've reached the conclusion that optimistic mode is just the
better approach. The main reason is that it provides very good opportunities
for parallel apply (and hence speedup), while _also_ providing very strong
consistency guarantees for existing (and new) applications. In more detail:


1. Optimistic replication has the best potential to discover opportunities
for parallel execution - it can run _anything_ in parallel that does not
conflict. In can even run many transactions in parallel that _do_ conflict!
Even if transactions T1 and T2 have a conflict, there is still a chance that
T1 will grab the conflicting lock first, and the first part of T2 can
successfully replicate in parallel with T1.

2. The in-order commit provides very strong and robust consistency
guarantees, both on data integrity (eg. in case of crash), and on
read-consistency (a query on slave will never see a state of data that did
not also exist on master). It is important not to under-estimate how
important this point can be for users to have confidence in using the
parallel replication.

3. When dependencies do prevent parallel execution of T1 and T2 (and we have
to kill and re-try T2), optimistic parallel replication still acts as a
pre-fetcher to speed up I/O-bound workloads. I/O latencies are so large as
to often dwarf any CPU bottleneck, and disk systems often require multiple
outstanding I/O requests to scale well.

4. Optimistic parallel replication re-uses existing mechanisms - it uses the
normal query execution path in storage engines to discover conflicts, and
normal thread kill/rollback is used to handle conflicts. It avoids all the
complexity of trying to duplicate the exact conflict semantics in the
dependency-checking algorithms. It avoids the need to be overly conservative
in the dependency-checking to ensure safety (eg. it trivially is able to
parallelise non-conflicting statement-based transactions on the same single
table). And in the dependency-checking approach, if you let just one corner
case through that you need to handle during execution - then you end up
implementing the full optimistic conflict handling _anyway_ (as I learned
the hard way :-).


There are a few limitations to optimistic parallel replication that I first
thought would be quite severe. But experience seems to show they are not;
here are my thoughts on what makes this so:

1. Optimistic parallel replication increases the rollback rate, sometimes
dramatically so. I think the point is that this rollback also happens in
parallel. Even a small increase (eg. 10%) in replication throughput can make
the difference between a slave that keeps up and one that lags behind, and
this is well worth the cost of increasing CPU usage with a core or two,
modern servers usually have plenty cores to spare. And rollback+retry
typically does not incur much extra I/O overhead (eg. pre-fetching).

2. In-order commits means that look-ahead past a long-running transaction is
limited by how many pending commits we can have outstanding. But as
Jean-François demonstrated in his benchmarks, optimistic parallel
replication can have thousands of outstanding commits and still scale. If
the slave is keeping up (and hopefully it is!), there probably are not
thousands of pending transactions to work in on parallel anyway. And even
approaches that do allow out-of-order commit probably need some limit on
look-ahead anyway (eg. I think MySQL uses a "checkpoint" mechanism.)

3. In-order parallel replication is limited to inter-transaction
parallelism, it cannot do intra-transaction parallel apply. Thus if the
workload mainly consists of large transactions with conflicts between them,
speedup is limited to prefetching. Well, this is exactly the limitation that
your idea is addressing! Which is one reason I found it so interesting to
learn about.


I really think parallel replication should *just*work* out-of-the-box for
the "normal" user. This is 2018, AMD server CPUs come with 32 cores, the
other day I saw a raspberry-pi-lookalike board with 6 cores. A decent,
robust implementation of parallel replication needs to be enabled by
default. The user shouldn't need to review all of their applications to
check if they can tolerate different read consistencies from on the master,
let alone re-write them to spread queries over multiple tables or databases.

The optimistic parallel replication is the first method I have seen that can
properly satisfy this - give decent speedup on most real-life workloads, in
a way that is transparent to applications. Users can just enable optimistic
parallel replication, and if it fails in some way, it is a bug that should
be fixed.

(How about enabling it by default in 10.4, BTW?)


Of course, 10.3 MariaDB parallel replication does not solve all problems,
far from it!

I agree that better dependency analysis in the scheduler can be usefully
combined with optimistic parallel replication, not just the conservative
mode.

For example, DDL currently stalls optimistic parallel replication
completely, and the user must manually set GTID domain id and control
dependencies to replicate DDL in parallel with other transactions.
Optimistic also does not do much for non-transactional DML. Improved
dependency tracking and scheduling could improve this.

Analysis could also determine events that are very likely to conflict, and
this way reduce unnecessary rollback. There is the "optimistic" vs.
"aggressive" --slave-parallel-mode distinction to enable/disable such
heuristics, but few are currently implemented.


> While the feature as you put in comments is largely about optimistic schedulers
> it still does apply to some corner cases in the conservative one.
> You mention 'different optimizer decision' which hints to SBR specifics.

Yes, the thd_rpl_deadlock_check() feature was originally implemented to fix
various corner cases in conservative parallel replication. I think this
mostly/only occurs when commit order on the slave is forced to be the same
as on master.

The "different optimiser decision" is the theoretical case where the
optimiser chooses to use an index on the master but a table scan on the
slave. Or chooses different indexes on master and slave (could also apply to
RBR perhaps). In this case different locks can be taken on master and slave
and it seems likely that order-conflicts could result.


> Here how it would go the simple way which must have sense 'cos just
> statically S2->S3 can't be of common pattern.

The concrete example was silly, of course, updating the same row twice in a
single transaction. But for example foreign key conflicts will be common.
Eg.:

  BEGIN;
  INSERT INTO parent_table(key,val) VALUES (10, "hello");
  INSERT INTO child_table(key, parent_key, val) VALUES (1, 10, "xizzy");
  INSERT INTO child_table(key, parent_key, val) VALUES (2, 10, "red");
  INSERT INTO child_table(key, parent_key, val) VALUES (3, 10, "cube");
  COMMIT;

where parent_key is a foreign key referencing "key" in parent_table.


> On the other hand the pulling is volunteer and assuming that the queue
> always has something the worker becomes busy non stop until prepare at
> least (perhaps we could involve it into as well).
> Optimistic version this method would be to leave the prepared modified
> branch to the hands of the leader and switch to the next group's events.

Right.

In fact, this is a limitation of the existing scheduling algorithm that
already has an effect which is seen for multi-source replication and for
replication with multiple GTID domain ids. In this case it is possible for
the SQL driver thread to completely fill up the worker queues with events
from one source / one domain. And the other sources/domains will stall until
the first one catches up. Which is not great. And thus we have the
work-around with slave_domain_parallel_threads which is rather crude and
inflexible.

A more flexible scheduler would also be needed to optimise the thread usage
related to wait_for_prior_commit(). Currently this method blocks the worker
thread, and thus for long lookahead in optimistic parallel replication,
large number of worker threads are needed, most of them idle as you also
pointed out. It might be possible to leave the THD in the group commit queue
and let the thread switch to work on another transaction meanwhile. This is
something I never really looked into, but I think there are similar
mechanisms already existing for thread pool operation. Not sure how much can
be gained nowadays from avoiding large number of idle threads, but this
again would require a better scheduling algorithm.


>> Or is there some mechanism by which two different threads could run S2 and
>> S3 in parallel, but still within a single transactions (inside InnoDB eg?) -
>> I imagine this is not the case.
>
> We are certainly going to that direction. Let me shy off detailing in
> already a day long mail :-), anyway it's
> better off to show a patch that templates an idea.

Right, running a single transaction (or even single statement) on multiple
threads is also something interesting to look at - also outside of
replication of course, but yes, probably best left for another discussion ;-)


 - Kristian.
Follow ups

Re: [Maria-discuss] Pipeline-stype slave parallel applier
From: andrei . elkin, 2018-11-26
References

Re: [Maria-discuss] Pipeline-stype slave parallel applier
From: Kristian Nielsen, 2018-11-19
Re: [Maria-discuss] Pipeline-stype slave parallel applier
From: andrei . elkin, 2018-11-20
Re: [Maria-discuss] Pipeline-stype slave parallel applier
From: Kristian Nielsen, 2018-11-21
Re: [Maria-discuss] Pipeline-stype slave parallel applier
From: andrei . elkin, 2018-11-22