maria-developers team mailing list archive

Thread
Date

Why do we need fsync() in commit() in internal two-phase commit?

To: Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 26 Oct 2010 13:44:09 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)

Currently, when an InnoDB/XtraDB transaction is committed with the binlog
enabled, we do three fsync()'s:

1. Inside prepare() in InnoDB

2. When writing to the binlog

3. Inside commit() in InnoDB

The fsync()s are done when --innodb-flush-log-at-trx-commit=1 and
sync_binlog=1; these settings are needed to be able to recover into a
consistent state between binlog and InnoDB after a crash during commit.

This got me thinking why this is really needed?

 - I understand why we need the fsync() in prepare(): otherwise we might after
   crash have a transaction in the binlog that is missing in InnoDB and that
   we cannot (currently) recover.

 - I understand why we need the fsync() in binlog write; otherwise the commit
   in InnoDB may reach the disk before the binlog write, and after a crash we
   might have a transaction in InnoDB missing in the binlog that cannot be
   recovered.

But why do we need the fsync() in commit()?

We do not need it to ensure durability or consistency. If we crash after
commit() returns (or just binlog write finishes), but before the InnoDB commit
reaches disk, the crash recovery at next server start will re-commit the
transaction inside InnoDB.

In fact, it seems to me the only reason for the third fsync() is that we call
TC_LOG_BINLOG::unlog() after InnoDB commit() returns. And unlog() may decide
to rotate the binlog once it has been called for all transactions written to
the current log file. And during recovery, we only read the latest binlog, so
transactions in older binlogs must have reached disk for recovery to work.

Do you agree that this is the only reason the third fsync() is needed?

If so, it seems it would not be too hard to avoid that fsync(). Eg. we could
recover from the last two binlog files instead of only one. We would need a
mechanism for InnoDB to tell the binlog that transaction `Xid' reached the
disk, in an asynchronous way (after returning from commit()).

[Just wanted to confirm (or the opposite) this reasoning... as we have been
talking about a way to avoid both the fsync() in prepare() /and/ the fsync()
in commit(), that may be a better project to implement that just avoiding the
one in commit().]

 - Kristian.

Follow ups

Re: Why do we need fsync() in commit() in internal two-phase commit?
From: Sergei Golubchik, 2010-10-26