maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Henrik Ingo <henrik.ingo@xxxxxxxxxxxxx>
Date: Wed, 17 Mar 2010 13:03:02 +0200
Cc: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>, maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <b2f15ab93d3be6464a2bd5fbb586ca88@localhost>
Reply-to: henrik.ingo@xxxxxxxxxxxxx
Sender: henrik.ingo@xxxxxxxxx
On Tue, Mar 16, 2010 at 7:32 AM, Alex Yurchenko
<alexey.yurchenko@xxxxxxxxxxxxx> wrote:
> I think "a cluster that outwards presents a consistent transactional view,
> yet internally does not have a total ordering on transactions" is an
> internally contradictory concept. Suppose node1 committed T1, but not T2
> yet, and node2 committed T2, but not T1 yet. Now do database dumps from
> those nodes. Which one will represent a consistent transactional view?

I'm not sure if this is the same as Kristian is thinking about, but I
don't see it as internally contradictory. See below...

> I guess you're trying to make a cause for parallel transaction applying on
> slaves here. You can do parallel out of order applying alright. It is just

This is relevant to being able to do parallel slave execution, but not
restricted to that.

Actually, what Kristian is describing can even happen on a
single-node, multi-threaded database. It is not strictly an issue with
clusters at all. I'll borrow my old example from
http://forge.mysql.com/worklog/task.php?id=4648

We have 2 tables t1 and t2, and 4 transactions applied in a sequence.
Assume that all 4 transactions are done "at the same time" by 4
different threads. The sequence randomly ends up being:

     /* Assume all columns are "0" to start */
     /* 0 */ (SET AUTOCOMMIT=1)
     /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1;
     /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1;
     /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1;
     /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1;

Now, depending on how the database internally commits transactions and
writes a transaction log, it may impose a total ordering on these
transactions or not. If you now replay these transactions (such as
apply them on a slave), what happens if you do:

     /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1;
     /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1;
     /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1;

     /* The following statement is discarded since NEW.timestamp <
     OLD.timestamp */

     /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1

Depending on what you want to achieve, skipping transaction #2 is ok
or bad. The (slave) state between #3 and #4 is now something that
never existed in the original (master) database. However, taken the
assumption that transactions were independently executed "at the same
time", the ordering was determined by chance, so it is a state that
*could have existed* on the original database. Whether it actually
could have existed, depends on our assumptions of what the application
is doing. If transactions 1-4 are really independent, it could have
existed. If they are interrelated - such as tables t1 and t2 depending
on each other - the application could never actually produce such a
state.

I agree with Alex that if a 100% total ordering is not imposed on the
transactions, it is not correct to say that the second replay (the
slave) is a *replica* of the first run (the master), since it can end
up in states that never existed on the master. However the original
sentence a cluster that outwards presents a consistent transactional
view, yet internally does not have a total ordering on transactions"
is not contradictory in itself, it is something that can happen even
on a single node database.

The argument in favor of "allowing" above like behavior is that we
should be allowed to assume that an application always commits
transactions that produce consistent end states. So if some "nearby"
transactions are then replicated in a different order, each state is
still meaningful to the application, even if some states may not have
actually existed on the master if you look at the database as a whole.

As an example, a bank has a database that reflects how much money we
have in our accounts. If I transfer 100€ to you, then the withdrawal
and deposit are done in the same transaction and the database state is
consitent at the end of the transaction. But if I give 100€ to charity
and you win 100€ on the lottery, these are independent transactions.
For practical purposes no application should care which way these
transactions were committed. Note that also other properties, for
instance the total amount of money the bank has,  would still always
be correct regardless of which way those 2 individual transactions
were committed. So if we now replicate the bank database, it is not in
practice important to replicate the transactions exactly in the same
sequence as they happened to execute on the master database. It is
only in theory that it is a problem that the slave may have states
(when looking at the database as a whole) that didn't actually exist
on the master and therefore are not replicas.

So personally I see that in theory any replication should of course
exactly produce the same states of the "original" database, in
practice if you can get significant performance gains by not imposing
a total ordering between each and every transactions, it is a
practical approach that should be allowed. (At least by an option,
since ultimately it is the app developer or DBA that can decide if
such replication is "safe" or not.)

> The point is that if you want _replication_, you gotta have a linear
> order. Without it you can't compare db states and therefore can't say that
> this db is a replica of that db. What you end up with is a so called
> "eventual consistency", and "eventual" may never happen.

Theoretically speaking, yes.

>> 3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous
>> replication. In this case there really is no total linear order, as the
>> transactions are not coordinated between servers. At most there would be
> an
>> approximate order like a time stamp, but that would be not 100% reliable
>> due
>> to time skew and such.
>
> That's what the concept of Replication Set was about below. In the
> scenario described above you effectively have multiple replication sets and
> multiple logical clusters. So each one of them will have its own global
> transaction ID sequence. If they are so much uncoupled there is no point in
> considering them a single cluster.

What you say above is significant. At least for myself, when speaking
of "replication" I think of solving both the problem of running a
cluster (as you define as tightly coupled above) and replication
between "uncoupled" databases (replication sets), such as 2
datacenters across 2 continents.

Trying to force a strict total ordering across 2 datacenters seemed
like a bad idea, now I understand you never intended to do that.


> I guess I have addressed this above already. In short, parallel applying
> does not need to mean out-of-order commit.

You still end up with some form of group commit.

Continuing from above, there are also applications that do
multi-master replication but the masters are "uncoupled" out of
necessity. An example is when you want to have geographical redundancy
and allow writes to all datacenters. As we have agreed above, it is
not practical to force a total ordering globally across the changesets
in each datacenter, but they still replicate to each other. (Using
conflict resolution such as based on a timestamp, or trying to
guarantee that one datacenter is always the single master for a
certain subset of records.) Apparently our mobile phone networks work
this way (the HLR).

In such a setup, you can do parallel applying all you want, since
there is no "one true" ordering of the transactions anyway.

>> In other words, is it necessary to enforce such total order at the level
> of
>> the API, instead of just introducing it at the point where it becomes
>> needed,
>> and if so, why?
>
> There are at least 2 problems with not having linearly ordered commits:
>
> 1) Database states are incomparable. It is impossible (ok, impractically
> hard) to tell if two databases have identical contents and determine which
> transactions are missing. How will you join a node to a cluster in this
> case?
>
> 2) Causality preservation. Suppose you commit something on master and now
> want to select results from slave. How will you tell that slave already has
> committed these results?
>
> I don't exactly understand how you separate "level of the API" and "the
> point where is becomes needed", do you mean code separation or making
> linear ordering an optional feature of the redundancy service?
>
> In any case, I think we all understand quite well what IS database
> replication and what are redundancy guarantees when commits are linearly
> ordered. Id like to hear a more detailed concept of redundancy in the
> absence of linearly ordered commits. What is redundancy in that case, what
> are the guarantees? How recovery will be performed and so on.

I think you yourself also admitted that in some cases the "replicas"
cannot be so tightly coupled that a globally consistent linear
ordering would be possible/desirable. It is true that some operations
like joining a new node become simple and convenient when you have a
linear ordering to refer to. But there are situations when you don't
have that, for some reason, the most obvious use case being
geographical replication.


> I don't think that you need 2PC between redundancy service and the storage
> engines, because redundancy service never fails. Well, when it fails, you
> have something more important to worry about than disk flushes anyways.

How does synchronous replication happen without 2PC?

henrik (choosing to take part in something interesting rather than
doing my real work :-)

-- 
email: henrik.ingo@xxxxxxxxxxxxx
tel:   +358-40-5697354
www:   www.avoinelama.fi/~hingo
book:  www.openlife.cc
Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17
References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16