maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
Date: Tue, 16 Mar 2010 07:32:53 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87mxy9kelx.fsf@knielsen-hq.org>
Organization: Codership Oy
User-agent: RoundCube Webmail/0.3.1
On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:

>> 2) It correctly identifies that (the part of) ID should be a monotonic
>> ordinal number.
> 
> Ok. But should it increase in order of transaction start? Or in order of
> transaction commit?

I think this is easy: until transaction is committed, it does not exist.
Even with MyISAM a certain time passes from query parsing to actually
modifying table data so that the changes become visible to others. So this
moment, when changes become visible to others defines the order of database
modifications. Transaction start order is hardly of any interest to anybody
except for the very curious.

> Or maybe your point is that it should be monotonic, but that the actual
> order
> of changesets is an implementation detail of the particular replication
> system? I'd really like to hear your thoughts on this.
> 
>> 3) It correctly identifies that the global transaction ID is generated
by
>> redundancy service - in that case "MYSQL_LOG::write(Log_event *)
>> (sql/log.cc, line 1708)"
> 
> Right. So the _concept_ of global transaction ID needs to be part of the
> framework. But the actual _implementation_ of that global transaction ID
> is an
> implementation detail of the redundancy service. Correct?

Point 3) actually answers the questions above, but, perhaps, not in an
obvious manner. There are two things:

- global transaction ID format and comparison operations are defined by
the API. I elaborated on it in the previous e-mail. But the values are
filled by the redundancy service. That's one of its "services".

- commit order should follow the global transaction ID order. So the ORDER
of commit operations is effectively decided by redundancy service. And here
maybe lies the biggest controversy of all.

>> What's bad about it:
> 
>> 2) It fails to address multi-master: (server_id, group_id) is not going
>> to
>> work - such pairs cannot be linearly ordered and, therefore, compared.
>> And
>> from the perspective of the node that needs to apply the changeset -
does
>> server_id really matter? It may be good for debugging, but it can't be
a
>> part of a global transaction ID.
> 
> I don't understand why it would not work for multi-master. Can you
> elaborate?

Suppose you have node1, node2 and node3. Node1 and node2 are acting as
masters. So node1 sends (node1, 11), (node1, 12). Node2 sends (node2, 2),
(node2, 3). Partial order of commits is well established, but node3 still
won't be able to tell in what general order it should apply those
transactions. Does (node1, 11) go before (node2, 3)? why?

> Especially given that you later suggest yourself:
> 
>> One possible implementation for that can be (UUID, long long) pair.
> 
> How is this different from (server_id, group_id)? (I'd like to
understand).

It is different in that UUID it that proposal is a replication sequence
UUID. It is one per cluster. All events generated by cluster nodes have the
same UUID. And it serves to distinguish replication events from different
clusters, explicitly making them incomparable (the same way that
(server_id, group_id) pairs with different server_id's are incomparable).
So that you can compare only transaction IDs with the same UUID.

> My thoughts on multi-master is that there are multiple scenarios,
> depending on
> how tightly coupled the nodes of the clusters are with respect to
> transactions:
> 
> 1. Tightly-coupled cluster, that maintain a cluster-wide serialised
> transaction concept. NDB replication is like this (for epochs, which are
> based
> on NDB global checkpoint IDs AFAIR). In this case, the entire cluster
> effectively is a single server wrt. replication, so it's not really
> multi-master for this discussion.
> 
> 2. One can imagine a cluster that outwards presents a consistent
> transactional
> view, yet internally does not have a total ordering on transactions.
> Transaction T1 only needs to be ordered after T2 if it can be proven (in
> terms
> of communication within the system) that T1 committed before T2. If
there
> is
> no evidence one way or the other to the actual order of T1 and T2 (eg.
they
> ran in parallel against disjoint data), there is no a priori reason they
> should have to be linearly ordered. If on the other hand there is such
> evidence (eg. T2 waited for a lock released by T1 commit), then there
would
> have to be such linear ordering. I think NDB individual transactions are
a
> bit
> like this. So this would make transaction IDs only partially ordered
(but
> any
> arbitrary extension to a total order would be valid).

I think "a cluster that outwards presents a consistent transactional view,
yet internally does not have a total ordering on transactions" is an
internally contradictory concept. Suppose node1 committed T1, but not T2
yet, and node2 committed T2, but not T1 yet. Now do database dumps from
those nodes. Which one will represent a consistent transactional view?

I guess you're trying to make a cause for parallel transaction applying on
slaves here. You can do parallel out of order applying alright. It is just
when you have committed T2 but not T1 yet, the node does not have a T2
state. And it does not have a T0 state. It has an undefined state. Only
after T1 is committed it suddenly becomes T2. There can be several ways to
deal with it. For example, you do apply T2, but you don't commit it. And
after you have applied T1, you commit both of them in one go.

The point is that if you want _replication_, you gotta have a linear
order. Without it you can't compare db states and therefore can't say that
this db is a replica of that db. What you end up with is a so called
"eventual consistency", and "eventual" may never happen.

> 3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous
> replication. In this case there really is no total linear order, as the
> transactions are not coordinated between servers. At most there would be
an
> approximate order like a time stamp, but that would be not 100% reliable
> due
> to time skew and such.

That's what the concept of Replication Set was about below. In the
scenario described above you effectively have multiple replication sets and
multiple logical clusters. So each one of them will have its own global
transaction ID sequence. If they are so much uncoupled there is no point in
considering them a single cluster.

> So I don't understand what issue you had in mind for multi-master
> replication
> here?

An ability to unambiguously identify a database state by a global
transaction ID.
 
>> I'll take this opportunity to put forth some theory behind the global
>> transaction IDs as we see it at Codership.
>>
>> 1. We have an abstract set of data subject to replication/logging. It
can
>> be a whole database, a schema, a table, a row. Lets call it a
Replication
>> Set (RS).
>>
>> 2. RS is undergoing changes in time which can be represented as a
series
>> of atomic changes. Let's call it RS History. That it is a _series_ is
>> trivial but important - otherwise we can't reproduce historical RS
state
>> evolution. Each RS change is represented by a changeset. Since it is a
>> series, RS changesets can be enumerated with a sequence of natural
>> numbers
>> without gaps within a given RS History. Here comes the first component
>> of a
>> global transaction ID: sequence number (seqno).
> 
> Right, the series _can_ be linearly ordered. But does it have to be?
> 
> Suppose we enforce an artificial linear order among some changesets that
> have
> no a priori data dependencies between them (either order of applying
them
> would be correct). It seems to me that this needlessly discards an
> opportunity
> for parallelism. Given that lack of parallelism (on the slave end) is
> arguably
> the biggest performance bottleneck of current MySQL replication, this is
> something to avoid if possible.

I guess I have addressed this above already. In short, parallel applying
does not need to mean out-of-order commit.

> Isn't it enough that changesets are partially ordered, as long as that
> partial
> order encodes all dependencies that need to be respected?
> 
> In other words, is it necessary to enforce such total order at the level
of
> the API, instead of just introducing it at the point where it becomes
> needed,
> and if so, why?

There are at least 2 problems with not having linearly ordered commits:

1) Database states are incomparable. It is impossible (ok, impractically
hard) to tell if two databases have identical contents and determine which
transactions are missing. How will you join a node to a cluster in this
case?

2) Causality preservation. Suppose you commit something on master and now
want to select results from slave. How will you tell that slave already has
committed these results?

I don't exactly understand how you separate "level of the API" and "the
point where is becomes needed", do you mean code separation or making
linear ordering an optional feature of the redundancy service?

In any case, I think we all understand quite well what IS database
replication and what are redundancy guarantees when commits are linearly
ordered. Id like to hear a more detailed concept of redundancy in the
absence of linearly ordered commits. What is redundancy in that case, what
are the guarantees? How recovery will be performed and so on.

>> 3. However there can be more than one RS. Moreover, the same RS can end
>> up
>> in different clusters and undergo different changes. So, to achieve
truly
>> global unambiguity each changeset, in addition to seqno, should be
marked
>> with a RS History ID. Obviously seqnos from different histories are
>> logically incomparable. Therefore RS History ID can be any globally
>> unique
>> identifier, with no need for < or > operations. This is the second
>> component of global transaction ID.
> 
> Can you give an example of what an "RS History" would be? It was not
100%
> clear to me.

It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...;
INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear
about it?

>> What is not so obvious here is that since global transaction ID is
>> generated by logging/replication service, it is that service that
defines
>> the order of commits, not vice versa. As a result transaction should
>> first
>> be passed to that service and only then committed. For one-way
>> master-slave
>> replication the order of operations is not so important. However for
>> multi-master it is crucial. Note that the actual replication/logging
can
>> still happen asynchronously, but replication service must generate
>> transaction ID before it is committed.
> 
> Right. But is this really possible? After all, the commit may fail, so
we
> don't know before if we will actually commit or rollback... we probably
> don't
> even know the actual order of commits in parallel threads of execution
> until
> after actual commit, unless we take a global lock around (global
> transaction
> id, commit); and then it sounds like we get into the issue with broken
> group
> commit again?

Why, it is possible alright. Galera replication works this way. There
always is a point after which commit just can't fail. When it's been
decided that it is committed.

> So we need two-phase commit here between the replication service and the
> storage engines, right? And in this case, it shouldn't matter which
comes
> first, commit or replication service, does it?
> 
> We need to be very careful with the interface here, two-phase commit can
be
> really expensive needing multiple disk flushes for every transaction to
> ensure
> reliable crash recovery. And we really do not want to re-do the mistake
of
> breaking group commit that we had for years after two-phase commit was
> introduced for the MySQL binlog.

I don't think that you need 2PC between redundancy service and the storage
engines, because redundancy service never fails. Well, when it fails, you
have something more important to worry about than disk flushes anyways.

As for the order, the problem is that in general you never know in what
order redundancy service will log/replicate transaction until you ask it.
It is crucial for multi-master replication as you never know what will be
the total order of transactions until you replicate and receive them. It is
important for (semi)synchronous replication. It also makes sense in regular
asynchronous master-slave as you can do replication concurrently with
committing. So asking redundancy service first is good all around.

> (I am also thinking that generating the global transaction ID at commit
> time
> is too late in the general case. The current MySQL replication buffers
> everything until commit time, but I would like to see us support also
> logging
> events during transactions. And this will require an ID generated at
> transaction start. But maybe this is a different ID than "global
> transaction
> ID", which will still be needed to know commit order of course.)

This is exactly the problem we're facing now with support for LOAD DATA
(and other potentially huge transactions). There clearly is the need for
two IDs in that case. But as you have noticed, the semantics of these two
IDs is rather different. Global transaction ID we've been talking about so
far needs to be linearly ordered (according to my understanding at least)
and does not need to have a server_id, while the ID generated at
transaction start does not have to be linearly ordered and SHOULD have a
server_id as part of it (to ensure uniqueness, manageability and simplify
debugging).

Perhaps it is a good time to settle on terminology. Besides 'global
transaction ID' is just prohibitively long for intensive discussions and
variable bames. Does anyone have better suggestions on how to call these
IDs?

>  - Kristian.

Regards,
Alex

-- 
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011
Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Henrik Ingo, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-17
References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15