← Back to team overview

maria-developers team mailing list archive

Re: comments on parallel applying


Hi Jonas and Kristian,

The idea of a hybrid approach seems very good.  My experience implementing
parallel apply on Tungsten leads me to believe that masters can supply
useful metadata for replication but cannot supply a definitive plan. There
are a number of reasons for this.

1. Slaves do not always apply all data. This is particularly true if you
are replicating heterogeneously, which we do quite a bit on Tungsten. It's
quite common to drop some fraction of the changes.

2. Slave resources are not fixed and their workload may differ
substantially from the master.  For instance, both CPU and I/O capacity are
variable, especially if there is any asymmetry related to host resources.
 Workloads are also asymmetric.  You may need to trade off resources
devoted to replication against read-only queries.  In Tungsten we tune the
number of threads for parallel apply as well as load balancing decisions
based on these considerations.

3. Slave side optimizations come into play. Tungsten can permit causally
independent replication streams to diverge substantially--for example you
could allow the slowest and fastest parallel threads to diverge by up to 5
minutes.  Doing so ensures that you continue to get good parallelization
even when workloads have a mix of very large and very small transactions.
The choice of interval depends on factors like how long you are willing to
wait for replication to serialize fully when going offline or how much
memory you have in the OS page cache.

MariaDB parallel apply works differently from Tungsten of course and you
may permit a different set of trade-offs.  In general though it seems that
the most valuable contributions from the master side are the following:

1.) Provide a fully serialized binlog. I cannot begin to say how helpful it
is that MySQL did this a long time ago.

2.) Provide as much metadata as possible about whether succeeding
transactions are causally independent.

3.) Where feasible limit transactions that would require full serialization
of replication.  For instance, it's very helpful to forbid transactions
from spanning schema boundaries, so you get a series of guaranteed causally
independent streams at the master.

Beyond that it's up to the slave to decide how to use the information when
applying transactions.

Cheers, Robert

On Fri, Jul 4, 2014 at 5:05 AM, Jonas Oreland <jonaso@xxxxxxxxxx> wrote:

> On Fri, Jul 4, 2014 at 10:26 AM, Kristian Nielsen <
> knielsen@xxxxxxxxxxxxxxx> wrote:
>> Jonas Oreland <jonaso@xxxxxxxxxx> writes:
>> > <quick thoughts on implementation>
>> > for row-based replication this seems quite "easy".
>> >
>> > for statement-based replication i image that you would have to add hooks
>> > into the "real" code
>> > after parsing has been performed, but before the actual execution is
>> > started (and yes, i know that there is sometimes a blurry line here)
>> > </thoughts>
>> A different approach could be to do this on the master.
>> When a transaction is binlogged, we have easy access to most/all of this
>> information. And there is room in the GTID event at the start of every
>> binlog
>> event group to save this information for the slave. Then the slave has the
>> information immediately when it starts scheduling events for parallel
>> execution. So this does not sound too hard. Though the amount of
>> information
>> that can be provided is then somewhat limited for space and other
>> reasons, of
>> course.
> or perhaps a hybrid approach.
> master does "interesting" annotations
> slave takes decision based on annotations *and* own analysis
> /Jonas
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp