maria-developers team mailing list archive

Thread
Date

Re: Implementing new "group commit" API in PBXT?

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Paul McCullagh <paul.mccullagh@xxxxxxxxxxxxx>
Date: Mon, 4 Oct 2010 10:08:54 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87hbh8lw7p.fsf@knielsen-hq.org>

Hi Kristian,

The easiest way to do this would be to add a parameter toxn_end_xact() that indicates that the log should not be written orflushed.

In xn_end_xact(), the last parameter to the call to xt_xlog_log_data()determines what should happen:


#define XT_XLOG_NO_WRITE_NO_FLUSH	0
#define XT_XLOG_WRITE_AND_FLUSH		1
#define XT_XLOG_WRITE_AND_NO_FLUSH	2

Without write or flush, this is a very fast operation. But thetransaction is still committed and ordered, it is just not durable.

Then, we have to make a note on the thread to flush the log when theactual commit is called.

But this need not be a general flush. The thread only needs to flushthe log past the point at which the commit record was written.

The position is returned by xlog_append(), which was called byxt_xlog_log_data() above. At the moment, this return value is ignored.

In the case of commit_ordered, this value must be stored. We then needto add the size of the COMMIT record to the offset.

Then when actual commit is called, we check the current log flushposition against the flush position we need. If it is passed ourposition then this is a NOP.

If not, then we need to call xlog_append() with no data. This will doa group commit on the log.

I was a bit difficult to explain, so please ask if anything is notclear.


Best regards,

Paul

On Sep 29, 2010, at 11:45 AM, Kristian Nielsen wrote:

I want to ask your opinion about implementing in PBXT an extensionto the
storage engine API that I am working on.

There are lots of details in
http://askmonty.org/worklog/Server-BackLog/?tid=116 (and even moredetails inother places), but I thought you would appreciate the shortversion :-)
The idea is to get a well-defined ordering of commits in the serverin an
efficient way (eg. not break group commit like InnoDB does currently,
Bug#13669).

For this, two new (optional) storage engine methods are introduced:

  void (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
  void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The prepare_ordered() method is called after prepare(), as soon ascommitorder is decided. The commit_ordered() method is called beforecommit(), just
after the transaction coordinator has made the final decision that the
transaction will be durably committed (and not rolled back).
The calls into commit_ordered() among different transactions willhappen inthe order that these transactions are committed, consistently acrossall
engines and the binary log. Same for prepare_ordered().
The idea is that the storage engine should do the minimal amount ofwork incommit_ordered() necessary to make the commit visible to otherthreads. And tomake sure commits appear to be done in the order of calls to thesemethods.
Do you think either (or both) of these methods could be implementedin PBXT
with reasonable effort (and if so, how)?

----
In InnoDB, this was trivial to do, as the InnoDB commit() methodalready had a"fast" part (which fixed the transaction log order (= "commitorder") and madethe transaction visible) and a "slow" part (which did the fsync() tomake the
transaction durable, and handled group commit).
(It is necessary that commit_ordered() is fast, as it runs under agloballock. Ideally, it will just allocate an LSN in the transaction logto fixcommit order, and perhaps whatever else already needs to happenserialised
during engine commit).
I hope my explanation was sufficiently clear for you to make aqualifiedanswer. Maybe you can point me to where in the PBXT code a commitbecomes
visible and the commit order is fixed?
In case you were wondering, here are some of the motivations forthis feature:
1. For taking a hot backup, it is useful to have consistent commitorderbetween binlog and storage engines. Without it, it can happen thatthebacked up state of the server has transaction A (but not B)committed inthe storage engine, and transaction B (but not A) written to thebinlog.
  Using such backup to provision a new master or slave would leave
  replication in an inconsistent state.
2. This feature implements working group commit for the binlog whilestill
  preserving consistent order as per (1).
3. This will allow to implement START TRANSACTION WITH CONSISTENTSNAPSHOT for
  multi-engine transactions that is truly consistent (currently is is
possible for a transaction to be visible in one engine but notanother in
  such "consistent" snapshot).
4. Galera relies on a consistent commit order, and I believe thisfeature will
  allow it to get this in a more engine-independent way.
5. We are planning to use consistent commit order to allow MySQL torecoverafter a crash transactions that were synced to disk in the binlogbut notin the engine. This will allow to reduce the number of fsyncs()duringprepare() / commit() from 3 to 1; it only needs to be done in thebinlog(with group commit); the engine does not need to fsync(), as anylost
  transactions will be recovered from the binlog after crash.
6. The prepare_ordered() method is inspired by the Facebook patch toreleaseInnoDB row-level read locks early (before syncing the binlog todisk) toimprove performance in the presence of hot spots (probably doesnot apply
  to PBXT).

- Kristian.




--
Paul McCullagh
PrimeBase Technologies
www.primebase.org
www.blobstreaming.org
pbxt.blogspot.com

Follow ups

Re: Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-10-05
Re: Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-10-04

References

Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-09-29