maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Wed, 17 Mar 2010 10:48:50 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <b2f15ab93d3be6464a2bd5fbb586ca88@localhost> (Alex Yurchenko's message of "Tue\, 16 Mar 2010 07\:32\:53 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx> writes:

> On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen
> <knielsen@xxxxxxxxxxxxxxx> wrote:

>>> One possible implementation for that can be (UUID, long long) pair.
>> 
>> How is this different from (server_id, group_id)? (I'd like to
> understand).
>
> It is different in that UUID it that proposal is a replication sequence
> UUID. It is one per cluster. All events generated by cluster nodes have the
> same UUID. And it serves to distinguish replication events from different
> clusters, explicitly making them incomparable (the same way that
> (server_id, group_id) pairs with different server_id's are incomparable).
> So that you can compare only transaction IDs with the same UUID.

Ok, thanks for the clarification, I think I see now. "server_id" does not
work, as there can be multiple different server_ids in a single cluster. We
need every server in the same cluster (NDB, Galera, ...) to use the same UUID
in the global transaction ID.

> I guess you're trying to make a cause for parallel transaction applying on
> slaves here. You can do parallel out of order applying alright. It is just
> when you have committed T2 but not T1 yet, the node does not have a T2
> state. And it does not have a T0 state. It has an undefined state. Only
> after T1 is committed it suddenly becomes T2. There can be several ways to
> deal with it. For example, you do apply T2, but you don't commit it. And
> after you have applied T1, you commit both of them in one go.

Agree. Even with parallel transaction application, one would want to commit in
some fixed (from the origin) order.

(Following above example, even if there is no evidence on master as to whether
T1 or T2 came first, we will want the order to be consistent among multiple
slaves, so that there is no possibility to see T1-without-T2 on one slave and
T2-without-T1 on another).

> There are at least 2 problems with not having linearly ordered commits:
>
> 1) Database states are incomparable. It is impossible (ok, impractically
> hard) to tell if two databases have identical contents and determine which
> transactions are missing. How will you join a node to a cluster in this
> case?

Yes. The issue of joining a cluster is also a good point.

After reading your comments, I tend to agree that global transaction IDs
should be totally ordered.

> I don't exactly understand how you separate "level of the API" and "the
> point where is becomes needed", do you mean code separation or making
> linear ordering an optional feature of the redundancy service?

(I meant that the comparison privided by the API could be a partial order, and
if it returned T1 and T2 as "equivalent", any plugin that needs a total order
could just pick T1 or T2 arbitrarily as the first. But I agree now that even
if the choice is arbitrary, it should be enforced in the API for consistency.)

>>> 3. However there can be more than one RS. Moreover, the same RS can end
>>> up
>>> in different clusters and undergo different changes. So, to achieve
> truly
>>> global unambiguity each changeset, in addition to seqno, should be
> marked
>>> with a RS History ID. Obviously seqnos from different histories are
>>> logically incomparable. Therefore RS History ID can be any globally
>>> unique
>>> identifier, with no need for < or > operations. This is the second
>>> component of global transaction ID.
>> 
>> Can you give an example of what an "RS History" would be? It was not
> 100%
>> clear to me.
>
> It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...;
> INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear
> about it?

I think about an RS as for example a set of tables in a set of schemas. And of
an RS History as for example a binlog.

So what is not clear to me is how the IDs get assigned to an RS, and to an RS
History. Is it assigned by the DBA in server configuration? Is it assigned
automatically by the server? Or assigned somehow by the redundancy service?

Also, it is not clear to me when the RS ID and RS History ID will differ.

So if I have a simple MySQL-style master->slave replication, will I have one
or two replication sets? If I have one master and two slaves, which of the
three servers will have the same RS ID, and which will have a different one?

Likewise, if I have a two-level MySQL-style replication
master->slave1->slave2, how many RS Histories will there be, and how many of
those will have different/same RS History IDs?

>>> What is not so obvious here is that since global transaction ID is
>>> generated by logging/replication service, it is that service that
> defines
>>> the order of commits, not vice versa. As a result transaction should
>>> first
>>> be passed to that service and only then committed. For one-way
>>> master-slave
>>> replication the order of operations is not so important. However for
>>> multi-master it is crucial. Note that the actual replication/logging
> can
>>> still happen asynchronously, but replication service must generate
>>> transaction ID before it is committed.

> I don't think that you need 2PC between redundancy service and the storage
> engines, because redundancy service never fails. Well, when it fails, you
> have something more important to worry about than disk flushes anyways.

Hm, I don't understand...

What if the redundancy service is invoked, it logs (or whatever) the
changeset. Then the machine crashes before the engine(s) have time to
commit. When the server comes back up, how do we avoid that the redundancy
service, (and hence the slaves) will have the change, but the master will not?

The ability to implement reliable crash-recovery is something I see requested
a lot, and I think it is really important.

In current MySQL replication, there is 2-phase commit between storage engine
and binlog. I do not see how this can be avoided in the general case (without
loosing the ability to recover after crash).

> As for the order, the problem is that in general you never know in what
> order redundancy service will log/replicate transaction until you ask it.
> It is crucial for multi-master replication as you never know what will be
> the total order of transactions until you replicate and receive them. It is
> important for (semi)synchronous replication. It also makes sense in regular
> asynchronous master-slave as you can do replication concurrently with
> committing. So asking redundancy service first is good all around.

Of course, this means that there can only be one redundancy service plugin at
a time, doesn't it?

But true, it is essential that storage engine(s) and redundancy service(s)
commit in the same order. It is however not yet clear to me that asking
redundancy service first is sufficient for this, nor that it is necessary.

So let me write up the steps to commit a transaction, and maybe the answer
will be clearer to me:

It seems to me that to ensure same order in participating engine(s) and
redundancy service(s), we will need a server-global commit lock, and a call
into each participant while that lock is taken:

    lock(commit_order_lock);
    redundancy_service->txn_fix_order(thd);
    engine->txn_fix_order(thd);
    unlock(commit_order_lock);

The idea is that in this call, each participant will do the minimum amount of
work to ensure that the transaction `thd' will be the next transaction
committed, *if* it is committed. Eg. insert the transaction data into the
transaction log/binlog buffer (but not necessary write it to disk yet).

Next, the transaction needs to be prepared in all participants, ie. two-phase
commit for crash recovery. I think this can be done without a global lock:

    redundancy_service->txn_prepare(thd);
    engine->txn_prepare(thd);

In this step, each participant must ensure that the transaction is persistent
(eg. fflush()). Also note that both this and the previous step may fail in
either engine, because of some internal error, or because of a crash. Thus
each participant must be prepared up to here to roll back the transaction,
either immediately or during crash recovery after bringing the server back up.

However, once this step is successfully completed for all participants, the
transaction can no longer be rolled back.

The group commit facility comes in here. Since this step is outside of the
global lock commit_order_lock, there could be multiple transactions waiting in
txn_prepare(). And each participant is free to use a single fflush() to
persist all of them at once. This is crucial for good performance.

(For maximum flexibility, we might even want this:

    redundancy_service->txn_prepare_start(thd);
    engine->txn_prepare_start(thd);

    redundancy_service->txn_prepare_wait(thd);
    engine->txn_prepare_wait(thd);

This would allow both participants to run their fflush() or whatever in
parallel, which seems to be desirable.)

(Note that participants would be free to implement all actions in
txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost of
loosing group commit functionality. Likewise, txn_prepare_wait() could be
empty, just the API leaves the option of having the split open.)

Finally, if all participants prepared their transactions without problems,
they will be really committed:

    redundancy_service->txn_commit(thd);
    engine->txn_commit(thd);

(I am not sure if we need to split into _start and _wait here. I seem to
recall that this step also needs to persist to disk (fflush()), but just now I
do not see why it would need it).

In the case of crash recovery, the server will ask each participant for the
list of transactions that had successfully completed the txn_prepare_wait()
step. For each transaction, if it completed in all participants it will be
txn_commit()'ed, otherwise it will be rolled back.

So with this, is there a need to invoke the redundancy service first? If we
do, it will be possible to pass the global transaction ID on to all the
subsequent steps. So yes, that seems desirable. Any other reasons?

Another point: It seems to me that in the general case, it will limit us to
require that no gaps in the global transaction ID can be introduced. I do not
see how to make group commit possible in this case (following above steps)?

So, any comments/objections to the above sketch of the commit part of an API?

>> (I am also thinking that generating the global transaction ID at commit
>> time
>> is too late in the general case. The current MySQL replication buffers
>> everything until commit time, but I would like to see us support also
>> logging
>> events during transactions. And this will require an ID generated at
>> transaction start. But maybe this is a different ID than "global
>> transaction
>> ID", which will still be needed to know commit order of course.)
>
> This is exactly the problem we're facing now with support for LOAD DATA
> (and other potentially huge transactions). There clearly is the need for
> two IDs in that case. But as you have noticed, the semantics of these two
> IDs is rather different. Global transaction ID we've been talking about so
> far needs to be linearly ordered (according to my understanding at least)
> and does not need to have a server_id, while the ID generated at
> transaction start does not have to be linearly ordered and SHOULD have a
> server_id as part of it (to ensure uniqueness, manageability and simplify
> debugging).

Yes, I agree, these are different. Basically, at transaction start one will
probably make a local transaction ID, and tag all events with this ID to know
they are from the same transaction (changeset). But for the final commit
event, one needs to assign the global transaction ID. The local IDs just need
to be unique, the global ID needs to be totally ordered.

I think I was confusing these two a bit previously.

> I think "a cluster that outwards presents a consistent transactional view,
> yet internally does not have a total ordering on transactions" is an
> internally contradictory concept. Suppose node1 committed T1, but not T2
> yet, and node2 committed T2, but not T1 yet. Now do database dumps from
> those nodes. Which one will represent a consistent transactional view?

I am thinking about shared-nothing storage engines like NDB, where tables are
partitioned, with partitions being located on different machines.

If T1 and T2 both involve node1 and node2, they will need to be linearly
ordered.

However, if T1 involves only node1, and T2 involves only node2, and there is
no communication between node1 and node2 while T1 and T2 runs, then there is
no way to distinguish whether T1 or T2 committed first.

So we can use (engine local) transaction IDs (T1, node1) and (T2, node2). And
if T1==T2, then that would mean that no outside entity has evidence as to
whether T1 or T2 was committed first, so we do not actually need a defined
ordering in this case. On the other hand, if such evidence could exist, then
we will need T1<T2 (or T2<T1).

So in this case we avoid having to do extra communication across all nodes
just to syncronise the "next transaction ID" counter if it is not needed.

... But I think you are right that in practise one would impose a total order
anyway, to get a consistent view. So if T1==T2, one would arbitrarily define
that T1 came before T2, since node1<node2. And maybe this is in any case
rather academic.

 - Kristian.
Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17
References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16