maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 29 Mar 2010 00:02:09 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <703469413d2b690a507d6418a96ece27@localhost> (Alex Yurchenko's message of "Fri\, 19 Mar 2010 04\:50\:05 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)

Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx> writes:

> On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
> <knielsen@xxxxxxxxxxxxxxx> wrote:

> Hm, how is it different from how it is done currently in MariaDB? Does
> txn_commit() have to follow the same order as txn_prepare()? If not, then
> the commit ordering imposed by redundancy service should not be a problem.

Ok, I checked, and indeed there is no requirement that prepare is done in same
order as commit.

In fact, there seems to be no requirement on ordering of commit among
different engines and binlog at all in the server itself!

(Since the XA/2pc in MySQL assumes every engine ensures durability by itself,
there is not requirement for any ordering. In case of a crash, each engine
will be able to recover every transaction successfully prepared, so it is just
a matter of deciding which of them to commit and which to rollback.)

So agree, there is no problem with the redundancy service imposing some order,
with the purpose of enabling recovery even without durability guarantee by
each individual engine.

----

Now, InnoDB _does_ have a requirement to commit in the same order as the
binlog (due to InnoBackup; if not same commit order, the snapshot made by the
backup may not correspond to any position in the binlog, which breaks
restore).

The way this is implemented in InnoDB is by taking a global mutex in InnoDB
prepare(), which is only release in InnoDB commit().

This is a really bad way to do things :-(. It means that only one (InnoDB)
transaction can be running the code between prepare() and commit(). Since this
is where the binlog is written (and the point where the redundancy service
makes the transaction durable in our discussions), this makes it impossible to
do group commit!

Again, I think a good solution to this is to have an (optional) storage engine
callback fix_commit_order(). This will be called after successful prepare(),
but before commit(). It should do the minimum amount of work necessary to make
sure that transactions are committed in the order that fix_commit_order() is
called. The upper layer (/redundancy service) will call fix_commit_order() for
all transaction participants under a global mutex, ensuring correct order.

    lock(global_commit_order_mutex)
    fix_binlog_or_redundancy_service_commit_order()
    for (each storage engine)
        engine->fix_commit_order()
    unlock(global_commit_order_mutex)

(If same commit order is not needed, the fix_commit_order() can be NULL, and
if all fix_commit_order() are NULL there is no need to take the muxes).

Then InnoDB does not need to hold a global mutex across prepare() / commit().
In fact all it needs to do in fix_commit_order() is to put the transaction
into a sorted list of pending commits. Then each transaction in commit() needs
only wait until it is first in this list, which is _much_ better than hanging
in prepare() waiting for _all_ transactions to commit!

(There are other implementation possible also, of course).

 - Kristian.

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-29

References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-18
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-19