← Back to team overview

maria-developers team mailing list archive

Re: Why do we need fsync() in commit() in internal two-phase commit?


Hi, Kristian!

On Oct 26, Kristian Nielsen wrote:
> Currently, when an InnoDB/XtraDB transaction is committed with the
> binlog enabled, we do three fsync()'s:
> 1. Inside prepare() in InnoDB
> 2. When writing to the binlog
> 3. Inside commit() in InnoDB
> why do we need the fsync() in commit()?
> We do not need it to ensure durability or consistency. If we crash
> after commit() returns (or just binlog write finishes), but before the
> InnoDB commit reaches disk, the crash recovery at next server start
> will re-commit the transaction inside InnoDB.
> In fact, it seems to me the only reason for the third fsync() is that
> we call TC_LOG_BINLOG::unlog() after InnoDB commit() returns. And
> unlog() may decide to rotate the binlog once it has been called for
> all transactions written to the current log file. And during recovery,
> we only read the latest binlog, so transactions in older binlogs must
> have reached disk for recovery to work.
> Do you agree that this is the only reason the third fsync() is needed?

Yes, sounds logical.
> If so, it seems it would not be too hard to avoid that fsync(). Eg. we could
> recover from the last two binlog files instead of only one. We would need a
> mechanism for InnoDB to tell the binlog that transaction `Xid' reached the
> disk, in an asynchronous way (after returning from commit()).

Reading two, three, or any number of binlogs is not a solution - it only
increases the chance of recovery to work, but does not guarantee that
it'll work. For a correct solution we'll need a way to call unlog()