maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 15 Mar 2010 12:29:14 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <3fbdb03dbc66408a51a776f9f62ad871@localhost> (Alex Yurchenko's message of "Wed\, 27 Jan 2010 15\:46\:20 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx> writes:

> The global transaction ID is a cornerstone concept of a any replication
> system which aspires to be pluggable, extensible and go beyond basic
> master-slave. It is hardly possible to even start designing the rest of the
> API without first setting on global transaction ID. This is one of the

Some very interesting thoughts, thanks! (and sorry for the delay in continuing
the discussion). Some questions and comments below.

> What's good about
> http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds:
>
> 1) It introduces the concept of atomic database changesets. (Which,
> ironically, it calls "groups" due to dreaded binlog heritage) 

Right. So this is what I think of as "transaction", though it need not be
(eg. NDB replication uses "epoch", which is a collection of multiple
transactions replicated as a whole).

> 2) It correctly identifies that (the part of) ID should be a monotonic
> ordinal number.

Ok. But should it increase in order of transaction start? Or in order of
transaction commit?

Or maybe your point is that it should be monotonic, but that the actual order
of changesets is an implementation detail of the particular replication
system? I'd really like to hear your thoughts on this.

> 3) It correctly identifies that the global transaction ID is generated by
> redundancy service - in that case "MYSQL_LOG::write(Log_event *)
> (sql/log.cc, line 1708)"

Right. So the _concept_ of global transaction ID needs to be part of the
framework. But the actual _implementation_ of that global transaction ID is an
implementation detail of the redundancy service. Correct?

> What's bad about it:

> 2) It fails to address multi-master: (server_id, group_id) is not going to
> work - such pairs cannot be linearly ordered and, therefore, compared. And
> from the perspective of the node that needs to apply the changeset - does
> server_id really matter? It may be good for debugging, but it can't be a
> part of a global transaction ID.

I don't understand why it would not work for multi-master. Can you elaborate?
Especially given that you later suggest yourself:

> One possible implementation for that can be (UUID, long long) pair.

How is this different from (server_id, group_id)? (I'd like to understand).

My thoughts on multi-master is that there are multiple scenarios, depending on
how tightly coupled the nodes of the clusters are with respect to transactions:

1. Tightly-coupled cluster, that maintain a cluster-wide serialised
transaction concept. NDB replication is like this (for epochs, which are based
on NDB global checkpoint IDs AFAIR). In this case, the entire cluster
effectively is a single server wrt. replication, so it's not really
multi-master for this discussion.

2. One can imagine a cluster that outwards presents a consistent transactional
view, yet internally does not have a total ordering on transactions.
Transaction T1 only needs to be ordered after T2 if it can be proven (in terms
of communication within the system) that T1 committed before T2. If there is
no evidence one way or the other to the actual order of T1 and T2 (eg. they
ran in parallel against disjoint data), there is no a priori reason they
should have to be linearly ordered. If on the other hand there is such
evidence (eg. T2 waited for a lock released by T1 commit), then there would
have to be such linear ordering. I think NDB individual transactions are a bit
like this. So this would make transaction IDs only partially ordered (but any
arbitrary extension to a total order would be valid).

3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous
replication. In this case there really is no total linear order, as the
transactions are not coordinated between servers. At most there would be an
approximate order like a time stamp, but that would be not 100% reliable due
to time skew and such.

So I don't understand what issue you had in mind for multi-master replication
here?

> I'll take this opportunity to put forth some theory behind the global
> transaction IDs as we see it at Codership.
>
> 1. We have an abstract set of data subject to replication/logging. It can
> be a whole database, a schema, a table, a row. Lets call it a Replication
> Set (RS).
>
> 2. RS is undergoing changes in time which can be represented as a series
> of atomic changes. Let's call it RS History. That it is a _series_ is
> trivial but important - otherwise we can't reproduce historical RS state
> evolution. Each RS change is represented by a changeset. Since it is a
> series, RS changesets can be enumerated with a sequence of natural numbers
> without gaps within a given RS History. Here comes the first component of a
> global transaction ID: sequence number (seqno).

Right, the series _can_ be linearly ordered. But does it have to be?

Suppose we enforce an artificial linear order among some changesets that have
no a priori data dependencies between them (either order of applying them
would be correct). It seems to me that this needlessly discards an opportunity
for parallelism. Given that lack of parallelism (on the slave end) is arguably
the biggest performance bottleneck of current MySQL replication, this is
something to avoid if possible.

Isn't it enough that changesets are partially ordered, as long as that partial
order encodes all dependencies that need to be respected?

In other words, is it necessary to enforce such total order at the level of
the API, instead of just introducing it at the point where it becomes needed,
and if so, why?

> 3. However there can be more than one RS. Moreover, the same RS can end up
> in different clusters and undergo different changes. So, to achieve truly
> global unambiguity each changeset, in addition to seqno, should be marked
> with a RS History ID. Obviously seqnos from different histories are
> logically incomparable. Therefore RS History ID can be any globally unique
> identifier, with no need for < or > operations. This is the second
> component of global transaction ID.

Can you give an example of what an "RS History" would be? It was not 100%
clear to me.

> What is not so obvious here is that since global transaction ID is
> generated by logging/replication service, it is that service that defines
> the order of commits, not vice versa. As a result transaction should first
> be passed to that service and only then committed. For one-way master-slave
> replication the order of operations is not so important. However for
> multi-master it is crucial. Note that the actual replication/logging can
> still happen asynchronously, but replication service must generate
> transaction ID before it is committed.

Right. But is this really possible? After all, the commit may fail, so we
don't know before if we will actually commit or rollback... we probably don't
even know the actual order of commits in parallel threads of execution until
after actual commit, unless we take a global lock around (global transaction
id, commit); and then it sounds like we get into the issue with broken group
commit again?

So we need two-phase commit here between the replication service and the
storage engines, right? And in this case, it shouldn't matter which comes
first, commit or replication service, does it?

We need to be very careful with the interface here, two-phase commit can be
really expensive needing multiple disk flushes for every transaction to ensure
reliable crash recovery. And we really do not want to re-do the mistake of
breaking group commit that we had for years after two-phase commit was
introduced for the MySQL binlog.

(I am also thinking that generating the global transaction ID at commit time
is too late in the general case. The current MySQL replication buffers
everything until commit time, but I would like to see us support also logging
events during transactions. And this will require an ID generated at
transaction start. But maybe this is a different ID than "global transaction
ID", which will still be needed to know commit order of course.)

 - Kristian.
Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27