maria-developers team mailing list archive

Thread
Date
Thoughts on Global Transaction ID, parallel slave, and multi-source replication

To: maria-developers@xxxxxxxxxxxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 09 Aug 2012 11:34:44 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
This is a writeup of some thoughts I have on getting Global Transaction ID in
MariaDB, as well as some related thoughts on parallel slave and multi-source
replication.

MySQL 5.6 has global transaction ID now, and I looked into it in some
details. Unfortunately, I _really_ did not like what I saw. There are two main
problems, in my opinion:

1. The basic design is bad, leading to an excessively complicated
implementation.

2. The implementation looks very incomplete/immature, to the point that I am
amazed that anyone would push code like this into a main tree.

Point 2 means that I really do not want to maintain this piece of **** in
MariaDB, though we can hope that things will improve in subsequent
releases. Point 1 leads me to suggest that instead of merging the MySQL 5.6
implementation, we could do our own implementation, using a design that is
both much simpler, and at the same time more flexible that what is in MySQL
5.6. I will elaborate further on these points in the rest of the mail.

----

With respect to point 2, let me quote this little gem from the MySQL 5.6 code:

    /*
    Apparently this code is not being called. We need to
    investigate if this is a bug or this code is not
    necessary. /Alfranio
    */
    DBUG_ASSERT(0); /NOTREACHED/

Now, certainly, there are dark corners of the MySQL replication code, and
adding a comment/assert like this could be justified if a developer encounters
it and finds it may be dead code.

But this is _new_ code, added in a new 5.6 release! So the authors of the new
feature, by their own statement, have no idea what the new code they are
pushing into the main tree is actually doing. That is just insane. I have no
idea what the replication team at MySQL is up to these days pulling stunts
like this, I have to confess I am disappointed.

I found many similar examples in the 5.6 code. I wrote up a more detailed
analysis in TODO-171, where I elaborate more on the problems I see with the
MySQL 5.6 Global Transaction ID code:

    https://mariadb.atlassian.net/browse/TODO-171

When the authors do not understand the code themselves, it really does not
seem attractive for us to try to support and bugfix it. I also found a lot of
what I think are left-over, unsused parts of an even more complicated design
that was apparently abandoned (or postponed?), but left in the code, which
does not make supporting this any easier.

----

To explain point 1, I first need to mention the MySQL 5.6 feature of parallel
slave apply, as I believe that a major factor in the design of MySQL Global
Transaction ID has been the need to interoperate with that feature.

The MySQL parallel replication slave feature works so that transactions on
tables in different databases can be run in parallel with each other
(incidentally, there is some indication that the MySQL replication team
considers this *the* meaning of parallel slave, which is sad, as it is just
one piece of the general problem of replication slave scalability - it helps
in some workloads, but it is useless for others).

So with parallel slave, transactions on the master can be executed in a
different order on the slave. This means that the crucial notion of a slaves
current position in the replication stream becomes more complicated - it is
not just a single point in the master binlog, rather every slave replication
thread can be at a different point in the master binlog.

MySQL 5.6 solves this by introducing "Gtid sets". A Gtid is a MySQL 5.6 global
transaction ID - a unique, 24-byte identifier for every transaction. The state
of a replication slave at any given moment is then

    the set Gtids of all transactions ever executed on that slave server

This notion of having a set of all transactions ever executed really
complicates the design. There is a need to represent such a set as basically
intervals of "holes" in the Gtid sequence, to avoid the memory representation
of such sets growing without bound over time. This in turns introduces the
requirement that there can _never_ be holes in the sequence of Gtids generated
on a master - if there ever is introduced a hole, slaves will wait around for
ever waiting for a transaction containing the missing Gtid. I think recovering
from errors with such a concept will be forever a nightmare.

I think this basic fundament of the design is the main source of my dislike
for the MySQL global transaction ID. It sounds nice on a first look, but once
you start to think deeper on the problem it just does not seem to work well.

----

So I think I have a better idea of how to make Global Transaction ID work well
with parallel slave based on executing transactions in different databases
out-of-order.

The MySQL Gtid is defined as a pair of server-id and sequence number. Every
master server has a single server-id (despite no less than 128 bits being
allocated for this - anyone planning a replication hierarchy with 2**128
servers?). The master generates a single stream of events in the binlog,
though streams from other servers may be interleaved with it in case of
multi-master.

It is then the task of a slave to try to split up the single master stream in
to multiple streams that can be executed in parallel. I think this is the root
of the problem. The slave is the wrong place to do this splitting! It is on
the master that we know how to interpret transactions (ie. that transactions
in different databases are / are not independent), so doing this on the slave
is harder. And two slaves can in general split differently, which makes it
harder for a slave deeper in the replication hierarchy to switch to a new
master (that master being itself a slave of a higher-up master), leading to
the need for a complex set-of-all-executed-transactions concept.

Suppose we instead make the decision on the _master_ on how to split up the
replication stream for parallel apply. Instead of a single server-id for the
master, we can have multiple - let us call them "Source ID".

On the _master_, we will decide that two transactions in different databases
are independent, and assign them different Source ID. The slave then receives
multiple replication streams, and is free to execute each stream in parallel -
or not, at the discretion of the DBA. Transactions _within_ a single stream
are always executed in-order, so the slave has no need to remember any sets of
all transactions ever executed - all it needs to remember is the sequence
number within each stream it has seen - ie. last sequence number for each
Source ID. This is a simple and obvious generalisation of the old concept of
current binlog position. Downstream slaves see the exact same binlog
sequences, each with transactions in the same order (though sequences may be
interleaved differently) - so changing masters is simpler, all servers in the
hierarchy have the same view.

So to implement the same functionality as MySQL 5.6 parallel replication, all
we need is on the master to generate Source ID as a hash of the database name
(just as MySQL 5.6 assigns transactions to slave threads based on such hash of
the database name). And implement a facility on the slave such that
transactions with different Source ID can be executed in parallel by multiple
slave threads.

And now, since we decide on the master, we can provide much more flexibility
for the application. We can have _some_ databases db1,db2 be executed in
parallel on the slave - while other databases db3,db4 are considered
dependent, and not modified out-of-order on the slave. Or the application can
declare that two transactions are independent within the same database or even
within the same table; for example log record inserts into a table can often
be allowed to happen in different order on slaves. Or the DBA can assign a
special Source ID to a long-runnign ALTER TABLE, and such ALTER TABLE can then
execute independently on slaves to not stall the normal load.

All we need is to provide a simple facility (eg. session variable) that
applications/users can use to specify Source ID.

And then the next step is multi-source replication, which is the much-request
feature that one slave can replicate from two or more different masters at the
same time. The different masters will just supply different Source IDs, so
apart from the need to maintain multiple TCP connections, one to each master,
from the rest of the code's point of view multi-source replication is no
different, it is just more independent replication streams.

So my suggestion is: rather than struggle to try and merge and support the
MySQL 5.6 Global Transaction ID and parallel slave, let us instead implement
something like the above idea, with multiple Source IDs on the master, and
parallel execution of distinct replication streams on the slave. I believe
this will be no more work, and the end result will be both simpler and more
flexible.

Thanks,

 - Kristian.
Follow ups

Re: Thoughts on Global Transaction ID, parallel slave, and multi-source replication
From: Stewart Smith, 2012-08-10
Re: Thoughts on Global Transaction ID, parallel slave, and multi-source replication
From: Sergei Petrunia, 2012-08-09