maria-developers team mailing list archive

Thread
Date
Implementing new "group commit" API in PBXT?

To: paul.mccullagh@xxxxxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Wed, 29 Sep 2010 11:45:14 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Hi Paul!

I want to ask your opinion about implementing in PBXT an extension to the
storage engine API that I am working on.

There are lots of details in
http://askmonty.org/worklog/Server-BackLog/?tid=116 (and even more details in
other places), but I thought you would appreciate the short version :-)

The idea is to get a well-defined ordering of commits in the server in an
efficient way (eg. not break group commit like InnoDB does currently,
Bug#13669).

For this, two new (optional) storage engine methods are introduced:

   void (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
   void (*commit_ordered)(handlerton *hton, THD *thd, bool all);

The prepare_ordered() method is called after prepare(), as soon as commit
order is decided. The commit_ordered() method is called before commit(), just
after the transaction coordinator has made the final decision that the
transaction will be durably committed (and not rolled back).

The calls into commit_ordered() among different transactions will happen in
the order that these transactions are committed, consistently across all
engines and the binary log. Same for prepare_ordered().

The idea is that the storage engine should do the minimal amount of work in
commit_ordered() necessary to make the commit visible to other threads. And to
make sure commits appear to be done in the order of calls to these methods.

Do you think either (or both) of these methods could be implemented in PBXT
with reasonable effort (and if so, how)?

----

In InnoDB, this was trivial to do, as the InnoDB commit() method already had a
"fast" part (which fixed the transaction log order (= "commit order") and made
the transaction visible) and a "slow" part (which did the fsync() to make the
transaction durable, and handled group commit).

(It is necessary that commit_ordered() is fast, as it runs under a global
lock. Ideally, it will just allocate an LSN in the transaction log to fix
commit order, and perhaps whatever else already needs to happen serialised
during engine commit).

I hope my explanation was sufficiently clear for you to make a qualified
answer. Maybe you can point me to where in the PBXT code a commit becomes
visible and the commit order is fixed?

In case you were wondering, here are some of the motivations for this feature:

1. For taking a hot backup, it is useful to have consistent commit order
   between binlog and storage engines. Without it, it can happen that the
   backed up state of the server has transaction A (but not B) committed in
   the storage engine, and transaction B (but not A) written to the binlog.
   Using such backup to provision a new master or slave would leave
   replication in an inconsistent state.

2. This feature implements working group commit for the binlog while still
   preserving consistent order as per (1).

3. This will allow to implement START TRANSACTION WITH CONSISTENT SNAPSHOT for
   multi-engine transactions that is truly consistent (currently is is
   possible for a transaction to be visible in one engine but not another in
   such "consistent" snapshot).

4. Galera relies on a consistent commit order, and I believe this feature will
   allow it to get this in a more engine-independent way.

5. We are planning to use consistent commit order to allow MySQL to recover
   after a crash transactions that were synced to disk in the binlog but not
   in the engine. This will allow to reduce the number of fsyncs() during
   prepare() / commit() from 3 to 1; it only needs to be done in the binlog
   (with group commit); the engine does not need to fsync(), as any lost
   transactions will be recovered from the binlog after crash.

6. The prepare_ordered() method is inspired by the Facebook patch to release
   InnoDB row-level read locks early (before syncing the binlog to disk) to
   improve performance in the presence of hot spots (probably does not apply
   to PBXT).

 - Kristian.
Follow ups

Re: Implementing new "group commit" API in PBXT?
From: Paul McCullagh, 2010-10-04