maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 18 Mar 2010 15:18:40 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <500abcf5b454d1ef01175eae10e261ad@localhost> (Alex Yurchenko's message of "Wed\, 17 Mar 2010 18\:26\:09 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)

Alex, I think this discussion is getting really interesting, and I understand
your points much better now, thanks for your many comments!

Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx> writes:

> On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen
> <knielsen@xxxxxxxxxxxxxxx> wrote:

>> So what is not clear to me is how the IDs get assigned to an RS, and to
> an
>> RS
>> History. Is it assigned by the DBA in server configuration? Is it
> assigned
>> automatically by the server? Or assigned somehow by the redundancy
> service?

> Ok, that's a good question. The way I see it is

<snip>

> underwent different sequences of changes. So here comes RS History ID. It
> has to be globally unique and is generated by the redundancy service
> (because it is the one who knows what's going on), just like global
> transaction sequence number. E.g. in Galera it is a time-based UUID.

Ok, so I think this means that the redundancy service will generate a new RS
History ID when it starts a new RS History. Exactly when and how this happens
will then depend on the particular plugin.

>> What if the redundancy service is invoked, it logs (or whatever) the
>> changeset. Then the machine crashes before the engine(s) have time to
>> commit. When the server comes back up, how do we avoid that the
> redundancy
>> service, (and hence the slaves) will have the change, but the master
> will
>> not?
>> 
>> The ability to implement reliable crash-recovery is something I see
>> requested
>> a lot, and I think it is really important.
>
> Well, this is what redundancy service is for: when you crash, you can
> recover, isn't it? My understanding is that you do want the redundancy
> service to have the change and that you can take it from there and commit
> uncommitted. This is how Galera works and I believe this is how semi-sync
> in MySQL works (there's hardly any other way for synchronous replication to
> work)

Aha, now I see what you mean!

So the redundancy service would provide the durability, using its transaction
log or whatever. This is quite different from current MySQL, where every
engine, as well as the binlog code, is individually responsible for
durability, and then consequently the 2-phase commit is needed to ensure that
everyone has the same idea of what should be durable.

So this is actually quite an appealing idea! With this system, each storage
engine (and secondary replication plugins like binlog) would no longer need to
ensure that transactions are durable, just that they are able to recover after
a crash into a consistent state.

This could translate concretely into (for example) setting
innodb_flush_log_at_txn_commit=0 without risking to loose transactions. So no
fflush() per commit in the engine (and I think therefore no need for group
commit).

So only the (primary) redundancy service would need to fflush() (or whatever)
to achieve durability.

So this is quite interesting! The whole issue of coordinating the state
between engines, master binlog thread, and slave threads, in the case of crash
recovery, is quite a mess in current MySQL, and this indeed sounds much
cleaner.

So let me list some potential issues with this that spring to my mind, and see
how they could be resolved:

1. On crash recovery, the engine must be able to tell where in the RS History
they have recovered themselves to, so that they can correctly be supplied with
the remaining transactions that they must apply.

The best way to do this is probably to extend the storage engine API (and the
engines) with a facility for this, so that the engines can log it in their
transaction log. However, it seems simple enough to emulate it for engines
that do not implement such extensions. It should be sufficient to create a
hidden table inside the engine with a single "transaction_id" column, and just
insert a new row with the ID just before committing in the engine.

Now, engine commit must be expected to fail from time to time. So it seems
hard to supply the engine with the global transaction ID (as that will not be
generated until engine commit_prepare has succeeded).

But we don't actually need the _global_, linearly ordered ID for this. It
should be sufficient with the local transaction ID generated at the start of
the transaction, as long as the redundancy service records the mapping between
this and the global transaction IDs. And of course if we choose to rollback
later then there will be no matching global transaction ID (which is fine).

So this sounds doable.

2. Multi-engine transactions.

So, if a transaction spans multiple engines, during crash recovery each engine
may recover to a different point in time. In this case there will be
transactions that need to be replayed against engine E1 but not against E2.

For row-based events, this seems doable. It should be possible to filter the
events to include only those for tables in the relevant engine.

Now for statement-based events, this is a problem. Eg. a multi-table updates
against multiple engines will be hard to apply to just one engine ...

Anyway, I think this is not a showstopper. It seems ok to say that
statement-based replication on multi-engine transactions will not guarantee
reliable crash recovery. Especially since InnoDB currently is the only
transactional engine that supports statement-based replication ...
non-transactional engines are in any case impossible to reliably crash-recover
of course.

So this is definitely an interesting idea. Quite different from how current
MySQL replication/binlogging works.

> My thinking is that while the engines need to have a guarantee that binlog
> write was successful, binlog (or any other redundancy service) does not
> need the same guarantee from the engines. So instead of full blown 2PC you
> need a simple if()/else clause:
>
> if (redundancy_service->pre_commit() == SUCCESS)
> {
>    /* commit */
> }
> else
> {
>    /* rollback */
> }

Yes. Until now, I did not consider the possibility that crashed engines could
be recovered from the local log in the redundancy service.

>> Of course, this means that there can only be one redundancy service
> plugin
>> at
>> a time, doesn't it?
>
> It sounds like that in the general case. But realistically, there'd be few
> plugins which have to decide on the global transaction ID, others can be
> told. E.g. unlike multi-master replication, binlog plugin can easily log
> events in the order it is told to. So you can effectively stack redundancy
> plugins putting the one which decides on global transaction ID on top. The
> others just follow the defined order. So you can have the following stack:
>
> multi-master synchronous plugin (for HA)
> ----
> master-slave asynchronous plugin (for fanning out)
> ----
> binlog plugin (for local recovery after crash)

Yes, agree.

>> But true, it is essential that storage engine(s) and redundancy
> service(s)
>> commit in the same order. It is however not yet clear to me that asking
>> redundancy service first is sufficient for this, nor that it is
> necessary.
>
> As far as I understand, any (semi)synchronous replication requires that.

Right ... I have to study (semi)synchronous replication techniques a bit more,
I'm not too familiar with them.

> Hm. I'm not well versed in MySQL engine interfaces, so I might be missing
> something here. But why can't we have something like this instead:
>
>    engine->txn_prepare(thd);
>
>    if (redundancy_service->pre_commit(thd) == SUCCESS)
>    {
>       engine->txn_commit(thd);
>       redundancy_service->post_commit(thd);
>    }
>    else
>    {
>       engine->txn_rollback(thd);
>    }
>
> pre_commit() call locks commit_order_lock in the right order.
> post_commit() releases commit_order_lock and the resources that might be
> allocated by transaction in the redundancy service. Or, if txn_commit() is
> to heavy an operation, then maybe like that?:
>
>    engine->txn_prepare(thd);
>
>    if (redundancy_service->pre_commit(thd) == SUCCESS)
>    {
>       engine->txn_fix_order(thd);
>       redundancy_service->post_commit(thd);
>       engine->txn_commit(thd);
>    }
>    else
>    {
>       engine->txn_rollback(thd);
>    }

There are (at least) two issues of interest:

1. Can engines commit in different order from redundancy service (and if they
can, does it matter)

2. Can we support group commit (or obtain equivalent performance in some other
way)?

To give durability, the redundancy service will typically need to fflush()
when it commits and generates the global transaction ID. So having group
commit means that while such fflush() for transaction T1 is running, other
transactions T2, T3, ... Tn are free to start (though not complete of course)
committing in the redundancy service. So that they can queue up, and once T1
fflush() is done, a single fflush() will be sufficient to commit all of T2,
... Tn.

So this means that T2 must be able to complete engine->txn_prepare() while T1
is still running inside redundancy_service->pre_commit(). There cannot be a
global lock around all of the above code.

So this in turn means that T3 will be able to complete engine->txn_prepare()
_before_ T2, even though T3 will commit _after_ T2 (at least in terms of
redundancy service).

So the question is if T3 can complete engine->txn_prepare() before T2, yet
complete txn_commit() after T2. I do not see why not, but I do not know if
this is actually the case, I will need to check.

By the way, is it even necessary to have redundancy_service->post_commit()? It
seems to me that it probably is not needed?

(Though something like pre_commit() and post_commit() will probably be needed
to support the SQL-level 2-phase commit/XA. This is the SQL syntax the user
can use to do two-phase commit between multiple systems used in the
application).

 - Kristian.

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-19

References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17