← Back to team overview

maria-developers team mailing list archive

Re: [Commits] Rev 4376: MDEV-6676: Speculative parallel replication in http://bazaar.launchpad.net/~maria-captains/maria/10.0

 

Hi Kristian,

There is one thing I have never understood about your parallel apply
algorithm.  How do you handle the case where the server crashes when some
threads have committed but others have not?  It seems as if you could have
a problem with recovery.

Cheers, Robert Hodges


On Wed, Sep 10, 2014 at 2:06 AM, Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
wrote:

> Jonas Oreland <jonaso@xxxxxxxxxx> writes:
>
> Hi Jonas, I actually was planning to discuss this with you, as it is based
> on
> some of the ideas you mentioned earlier on parallel replication...
>
> >>   Intermediate commit. Patch is far from complete, but this small patch
> was
> >>   nevertheless sufficient to be able to sysbench-0.4 OLTP with full
> >>   parallelisation.
> >>
> >
> > "full parallelisation" does that mean that X threads on master make slave
> > achieve k*X higher throughput ?
>
> Hm, actually, it's not related to threads on the _master_ at all. Rather,
> it
> is potentially a throughput of k*Y where Y is the number of worker threads
> on
> the _slave_, up to some limit of scalability, of course.
>
> Suppose in the binlog we have transactions T1, T2, T3, T4. With this
> patch, we
> are going to try to replicate _all_ of them in parallel (up to a maximum
> of Y).
>
> If the transactions are non-conflicting, then great, everything will work
> fine
> and we will still commit them in the correct order, so applications will
> not
> see any difference.
>
> But suppose eg. T3 modifies the same row as T1, and T3 manages to touch the
> row first. In this case, T1 will need to wait for T3. This is detected as a
> deadlock (because T3 needs to eventually wait for T1 to commit before). So
> we
> roll back T3, allowing T1 to continue, and later re-try T3.
>
> So it is safe to try to run everything in parallel, at least for
> transactional
> events that can be safely rolled back.
>
> The only catch seems to be if there are a lot of potential conflicts in the
> application load. Then we could end up with too many rollbacks, causing
> throughput to decrease rather than increase.
>
> The next step is to add some flags to the GTID event on the master, and use
> those flags to control what to run in parallel on the slave:
>
>  - If DDL or non-transactional tables are involved, set a flag to not run
> this
>    event group in parallel with those that come before or after.
>
>  - Remember on the master if a transaction had to do a lock wait on another
>    transaction; in this case it seems likely that a similar wait could be
>    needed on the slave, so do not start this transaction in parallel with
> any
>    earlier ones.
>
>  - Maybe we can have a flag for "large" transactions that modify many
> rows; we
>    could choose not to run those in parallel with earlier transactions, to
>    avoid the need for expensive rollback of lots of rows.
>
>  - Allow the user to set some @@rpl_not_parallel variable, to explicitly
>    annotate transactions that are known to be likely to conflict, and hence
>    not worth it to try to run in parallel.
>
> This should be simple to do. Later we could also think about adding checks
> on
> the slave to further control what to do in parallel, however, I have not
> thought much about this.
>
> This patch seems to have a lot of potential to finally get a good solution
> to
> the single-threaded slave problem. But testing against real-life workloads
> will be
> needed to understand how to balance the speculative parallelisation against
> avoiding excessive rollbacks.
>
>  - Kristian.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

References