← Back to team overview

maria-developers team mailing list archive

Re: Ideas for improving MariaDB/MySQL replication


On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
>>> Can you give an example of what an "RS History" would be? It was not
>> 100%
>>> clear to me.
>> It is the sequence of changes that happens to RS. Like UPDATE t1
>> WHERE...;
>> INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not
>> clear
>> about it?
> I think about an RS as for example a set of tables in a set of schemas.
> And of
> an RS History as for example a binlog.
> So what is not clear to me is how the IDs get assigned to an RS, and to
> RS
> History. Is it assigned by the DBA in server configuration? Is it
> automatically by the server? Or assigned somehow by the redundancy
> Also, it is not clear to me when the RS ID and RS History ID will

Ok, that's a good question. The way I see it is

1) RS ID. Obviously RS is an object and as such needs some way to refer to
it. Who decides what goes into RS and that we even need one? - Redundancy
service user. So he creates it and he names it. This ID has to be unique
just to the extent that there'd be no confusion between two RS's should
they meet in the same context. Say, it can be a human-readable name. In
trivial cases, when RS is just a whole database, like in our wsrep API, it
could be a cluster name. Galera nodes refuse to connect to nodes with a
different cluster name. In general case, when there could be several RS's
on one physical box (e.g. one for replication and one for binlogging),
we'll have introduce a concept of a logical cluster - i.e. all processes
which participate in replication of a given RS.

2) RS History ID. Suppose you have a cluster that replicates "RS1". And
currently it is at T12345. A node joins the cluster and says: "I have 'RS1'
at T12344". What we can have here:
 - the node was previously a member of this cluster, went down for
maintenance, meanwhile cluster committed one more trx and all that is left
is to send T12345 to this node.
 - the node never was a member of this cluster, 'RS1' name stands there by
operator mistake. This node needs a total state snapshot.
 - the node was previously a member of this cluster, lost connectivity to
it at T12000, but went on accepting local queries with hopes for subsequent
conflict resolution (i.e. effectively spawned a new cluster of its own).
There needs to be conflict resolution phase (hopefully this redundancy
plugin supports it)
 - something else happened

So in all cases but the first one we need to be able to tell that T12344
is not really comparable to T12345, because the state of that node does not
really correspond to the state of the cluster at T12344 because they
underwent different sequences of changes. So here comes RS History ID. It
has to be globally unique and is generated by the redundancy service
(because it is the one who knows what's going on), just like global
transaction sequence number. E.g. in Galera it is a time-based UUID.

So RS ID is just an RS name. RS History ID is a part of a global
transaction ID. There does not have to be 1-to-1 relation between the two.

> So if I have a simple MySQL-style master->slave replication, will I have
> one
> or two replication sets? If I have one master and two slaves, which of
> three servers will have the same RS ID, and which will have a different
> one?

All of the will have the same RS ID as they participate in replication if
the same RS. Moreover they'll have the same RS History ID as they are
obviously have the save history of RS changes which is defined
singlehandedly by master.

> Likewise, if I have a two-level MySQL-style replication
> master->slave1->slave2, how many RS Histories will there be, and how
> of
> those will have different/same RS History IDs?

Same as above. In this example - one.

>> I don't think that you need 2PC between redundancy service and the
>> storage
>> engines, because redundancy service never fails. Well, when it fails,
>> have something more important to worry about than disk flushes anyways.
> Hm, I don't understand...
> What if the redundancy service is invoked, it logs (or whatever) the
> changeset. Then the machine crashes before the engine(s) have time to
> commit. When the server comes back up, how do we avoid that the
> service, (and hence the slaves) will have the change, but the master
> not?
> The ability to implement reliable crash-recovery is something I see
> requested
> a lot, and I think it is really important.

Well, this is what redundancy service is for: when you crash, you can
recover, isn't it? My understanding is that you do want the redundancy
service to have the change and that you can take it from there and commit
uncommitted. This is how Galera works and I believe this is how semi-sync
in MySQL works (there's hardly any other way for synchronous replication to

In a sense you can see the redundancy service as a glorified transaction

> In current MySQL replication, there is 2-phase commit between storage
> engine
> and binlog. I do not see how this can be avoided in the general case
> (without
> loosing the ability to recover after crash).

I might be not qualified enough to comment on this particular solution as
I don't know that code, perhaps other developers can. AFAIK, binlog in
MySQL is treated as some kind of a storage engine and perhaps this is why
2PC is used, but it's a big mistake. Binlog is not a storage engine (yes,
they both write to disk, but that's where similarities end). It needs
different interface - exactly what we're discussing right now.

My thinking is that while the engines need to have a guarantee that binlog
write was successful, binlog (or any other redundancy service) does not
need the same guarantee from the engines. So instead of full blown 2PC you
need a simple if()/else clause:

if (redundancy_service->pre_commit() == SUCCESS)
   /* commit */
   /* rollback */

More on that below

>> As for the order, the problem is that in general you never know in what
>> order redundancy service will log/replicate transaction until you ask
>> It is crucial for multi-master replication as you never know what will
>> the total order of transactions until you replicate and receive them.
>> is
>> important for (semi)synchronous replication. It also makes sense in
>> regular
>> asynchronous master-slave as you can do replication concurrently with
>> committing. So asking redundancy service first is good all around.
> Of course, this means that there can only be one redundancy service
> at
> a time, doesn't it?

It sounds like that in the general case. But realistically, there'd be few
plugins which have to decide on the global transaction ID, others can be
told. E.g. unlike multi-master replication, binlog plugin can easily log
events in the order it is told to. So you can effectively stack redundancy
plugins putting the one which decides on global transaction ID on top. The
others just follow the defined order. So you can have the following stack:

multi-master synchronous plugin (for HA)
master-slave asynchronous plugin (for fanning out)
binlog plugin (for local recovery after crash)

> But true, it is essential that storage engine(s) and redundancy
> commit in the same order. It is however not yet clear to me that asking
> redundancy service first is sufficient for this, nor that it is

As far as I understand, any (semi)synchronous replication requires that.

> So let me write up the steps to commit a transaction, and maybe the
> will be clearer to me:
> It seems to me that to ensure same order in participating engine(s) and
> redundancy service(s), we will need a server-global commit lock, and a
> into each participant while that lock is taken:
>     lock(commit_order_lock);
>     redundancy_service->txn_fix_order(thd);
>     engine->txn_fix_order(thd);
>     unlock(commit_order_lock);
> The idea is that in this call, each participant will do the minimum
> of
> work to ensure that the transaction `thd' will be the next transaction
> committed, *if* it is committed. Eg. insert the transaction data into
> transaction log/binlog buffer (but not necessary write it to disk yet).
> Next, the transaction needs to be prepared in all participants, ie.
> two-phase
> commit for crash recovery. I think this can be done without a global
>     redundancy_service->txn_prepare(thd);
>     engine->txn_prepare(thd);
> In this step, each participant must ensure that the transaction is
> persistent
> (eg. fflush()). Also note that both this and the previous step may fail
> either engine, because of some internal error, or because of a crash.
> each participant must be prepared up to here to roll back the
> either immediately or during crash recovery after bringing the server
> up.
> However, once this step is successfully completed for all participants,
> transaction can no longer be rolled back.
> The group commit facility comes in here. Since this step is outside of
> global lock commit_order_lock, there could be multiple transactions
> waiting in
> txn_prepare(). And each participant is free to use a single fflush() to
> persist all of them at once. This is crucial for good performance.
> (For maximum flexibility, we might even want this:
>     redundancy_service->txn_prepare_start(thd);
>     engine->txn_prepare_start(thd);
>     redundancy_service->txn_prepare_wait(thd);
>     engine->txn_prepare_wait(thd);
> This would allow both participants to run their fflush() or whatever in
> parallel, which seems to be desirable.)
> (Note that participants would be free to implement all actions in
> txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost
> loosing group commit functionality. Likewise, txn_prepare_wait() could
> empty, just the API leaves the option of having the split open.)
> Finally, if all participants prepared their transactions without
> they will be really committed:
>     redundancy_service->txn_commit(thd);
>     engine->txn_commit(thd);
> (I am not sure if we need to split into _start and _wait here. I seem to
> recall that this step also needs to persist to disk (fflush()), but just
> now I
> do not see why it would need it).
> In the case of crash recovery, the server will ask each participant for
> list of transactions that had successfully completed the
> step. For each transaction, if it completed in all participants it will
> txn_commit()'ed, otherwise it will be rolled back.
> So with this, is there a need to invoke the redundancy service first? If
> do, it will be possible to pass the global transaction ID on to all the
> subsequent steps. So yes, that seems desirable. Any other reasons?
> Another point: It seems to me that in the general case, it will limit us
> require that no gaps in the global transaction ID can be introduced. I
> not
> see how to make group commit possible in this case (following above
> So, any comments/objections to the above sketch of the commit part of an
> API?

Hm. I'm not well versed in MySQL engine interfaces, so I might be missing
something here. But why can't we have something like this instead:


   if (redundancy_service->pre_commit(thd) == SUCCESS)

pre_commit() call locks commit_order_lock in the right order.
post_commit() releases commit_order_lock and the resources that might be
allocated by transaction in the redundancy service. Or, if txn_commit() is
to heavy an operation, then maybe like that?:


   if (redundancy_service->pre_commit(thd) == SUCCESS)

>> I think "a cluster that outwards presents a consistent transactional
>> view,
>> yet internally does not have a total ordering on transactions" is an
>> internally contradictory concept. Suppose node1 committed T1, but not
>> yet, and node2 committed T2, but not T1 yet. Now do database dumps from
>> those nodes. Which one will represent a consistent transactional view?
> I am thinking about shared-nothing storage engines like NDB, where
> are
> partitioned, with partitions being located on different machines.
> If T1 and T2 both involve node1 and node2, they will need to be linearly
> ordered.
> However, if T1 involves only node1, and T2 involves only node2, and
> is
> no communication between node1 and node2 while T1 and T2 runs, then
> is
> no way to distinguish whether T1 or T2 committed first.
> So we can use (engine local) transaction IDs (T1, node1) and (T2,
> And
> if T1==T2, then that would mean that no outside entity has evidence as
> whether T1 or T2 was committed first, so we do not actually need a
> ordering in this case. On the other hand, if such evidence could exist,
> then
> we will need T1<T2 (or T2<T1).
> So in this case we avoid having to do extra communication across all
> just to syncronise the "next transaction ID" counter if it is not

Oh, yes, this is a corner case which is not easily described by the
suggested model. But I guess it is highly unconventional and does not
really qualify as redundancy or replication in the defined sense. My
understanding is that one model never fits all. If we need something like
NDB, it is better to develop a separate model and API for that, rather than
try to fit everything in one.

> ... But I think you are right that in practise one would impose a total
> order
> anyway, to get a consistent view. So if T1==T2, one would arbitrarily
> define
> that T1 came before T2, since node1<node2. And maybe this is in any case
> rather academic.
>  - Kristian.


Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Follow ups