← Back to team overview

maria-developers team mailing list archive

Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)


Hi Kristian,

I am happy to see you being active on the maria-developers list. I
don’t think we have collaborated in the past. I have been working on
the InnoDB storage engine under MySQL 4.0 through 5.7 (and some 8.0)
since 2003, and since 2016 under MariaDB Server, mostly 10.1 and
later. I understand that you were an early contributor to MariaDB and
implemented some durability I/O optimization between the "distributed
transaction" of binlog and the InnoDB write-ahead log (redo log,

I think that some improvement in this area is long overdue. I find it
hard to read the high-level descriptions. Nowhere in
https://jira.mariadb.org/browse/MDEV-232 or
https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key

* InnoDB log sequence number (LSN): a logical point of time for any
persistent InnoDB changes. Measured in bytes. Mini-transactions
(atomic changes of InnoDB pages) start and end at some LSN.
Corresponds to the Oracle system change number (SCN).
* Binlog offset: a logical point of time for any changes at the
replication level. Measured in bytes. Replication events such as
statements or row-level events start and end at some binlog offset.
* The LSN or binlog offset of a COMMIT, XA PREPARE, XA ROLLBACK, or XA COMMIT.

There is quite a bit of code in InnoDB that I would like to get rid
of. Some of it is related to the data members trx_t::flush_log_later
and trx_t::must_flush_log_later. There is also the confusingly named
handlerton::commit_checkpoint_request that has nothing to do with log
checkpoints, but with some kind of a durability request that fails to
identify the minimum logical time up to which everything is desired to
be durable. (My talk
https://fosdem.org/2022/schedule/event/mariadb_innodb/ explains what a
log checkpoint is and how recovery works.)

Some years ago, in MDEV-21534 Vladislav Vaintroub introduced
write_lock and flush_lock in InnoDB. This works pretty well: a
Sysbench workload with innodb_flush_log_at_trx_commit=1 will linearly
improve throughput as you add more client connection threads. I think
that something similar would be beneficial for the binlog, to keep
track of the latest written offset and the latest write offset after
which fdatasync() has completed. I am not familiar with the binlog
code. A colleague recently claimed to me that LOCK_commit_ordered is a
huge bottleneck. As you can see, the write_lock and flush_lock are
better because the lock requests include a notion of logical time: If
everything up to "my" time has been written or durably written, I do
not care to wait. In MDEV-27774 (MariaDB Server 10.8) this was
improved further: multiple mini-transactions can copy their local log
buffer to the shared log_sys.buf concurrently.

It seems to me that the missing tracking of logical time in the data
structures is causing hangs like MDEV-25611. I feel that the design is
based on a wrong assumption that the binlog-covered transactions are
the only source of InnoDB log writes. That has never been the case:
For example, there have always been background threads or tasks that
purge the history of committed transactions.

The task MDEV-18959 (ensuring that one fdatasync() per durable
transaction suffices) is also related to this. I think that the
simplest solution (assuming that moving the binlog to a bunch of
transactional tables is out of the question) is to ensure that one of
the logs (binlog and other write-ahead logs) is always ahead. If we
choose that the binlog must be ahead, then in InnoDB the write_lock
acquisition would have to be preceded by a binlog_write_lock
acquisition that will ensure that any relevant changes of persistent
state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part
of the InnoDB log write must have been durably written to the binlog.

With best regards,

Marko Mäkelä, Lead Developer InnoDB
MariaDB plc

Follow ups