← Back to team overview

maria-developers team mailing list archive

Re: Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)


Marko Mäkelä <marko.makela@xxxxxxxxxxx> writes:

> later. I understand that you were an early contributor to MariaDB and
> implemented some durability I/O optimization between the "distributed
> transaction" of binlog and the InnoDB write-ahead log (redo log,
> ib_logfile0).

Indeed, that was the group commit work:


> I think that some improvement in this area is long overdue. I find it

Agree. I think it would be fantastic to combine your extensive knowledge of
InnoDB with my binlog/replication expertise to improve this! There's a lot
of very interesting points in your mail:

> hard to read the high-level descriptions. Nowhere in
> https://jira.mariadb.org/browse/MDEV-232 or
> https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key
> terminology:

Indeed, my understanding of InnoDB internals is far from complete (though I
do understand the concept of LSN at a high level of course).

>From the binlog side, there are some basic design goals that influence how
the binlog group commit is implemented.

1. The APIs (handlerton methods and server calls) are designed to be
storage-engine independent. So this is one reason why we would not see an
InnoDB LSN anywhere in the server part.

2. Transactions (the replication term is "event group") are always written
into the binlog as a single atomic (kind of) block. So no interleaving of
transactions in the binlog, or transactions that span multiple binlog files.

3. Transactions in the binlogs are strictly ordered, even across different
servers in the same replication topology (this is used for GTID and in-order
optimistic parallel replication).

4. Identical commit order is enforced between the binlog and the storage
engine. I think this is related to getting consistent replication slave
position from an innobackup or LVM/BTRFS snapshot backup.

> and trx_t::must_flush_log_later. There is also the confusingly named
> handlerton::commit_checkpoint_request that has nothing to do with log
> checkpoints, but with some kind of a durability request that

This was an optimisation to remove one fsync from commits when binlog is


If the server crashes, then on restart the binlog transaction coordinator
needs to recover the XA state between the binlog and the engines. For this
it scans the storage engines and the binlog files for any pending
transactions. Any binlog file that contains pending XA transactions
needs to be scanned. Once all transactions in a binlog file are durable
committed in the engines, it is preferred to not scan it for efficiency

commit_checkpoint_request() is called when the binlog wants to mark a point
before which binlog files no longer need to be scanned (a "binlog
checkpoint"). We can see in innobase_checkpoint_request() that this is
mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above,
where the server-level API (binlog) is kept separate from storage engine
internals (InnoDB LSN).

> fails to
> identify the minimum logical time up to which everything is desired to
> be durable. (My talk

I wonder if this is a real mismatch between the API and InnoDB, or a
misunderstanding about what is intended from commit_checkpoint_request().
The binlog checkpoint mechanism is completely asynchronous, and it is not a
request for the storage engine to take any action to make things durable. It
is merely a request for InnoDB to conservatively pick an LSN that is known
to contain any transaction that has completed commit_ordered(), it may be
any later LSN that is cheaper to obtain. Then later, when that LSN is
eventually made durable on the engine's own initiative, the engine will call
commit_checkpoint_notify_ha() to complete the binlog checkpoint.

The intention is that this will be very easy for an engine to implement,
without any changes to how it manages internal logs and durability. It is
just a way for the engine to periodically communicate that some LSN is now
durable, in a way that can match the LSN to a binlog position without
assuming a particular storage engine or LSN format. But if this intention
fails, we should look into it.

> Some years ago, in MDEV-21534 Vladislav Vaintroub introduced
> write_lock and flush_lock in InnoDB. This works pretty well: a

> If
> everything up to "my" time has been written or durably written, I do
> not care to wait.

So if I understand correctly, records from parallel transactions are written
concurrently into the InnoDB log in an interleaved fashion. So one thread
can just put log records into a buffer as they are generated and continue
query execution, and later they may be written or synced asynchroneously by
another thread.

This is not how the binlog works because of points (2) and (3) above. The
full set of log records ("binlog events") is only available during the
commit step, and it needs to be written into the binlog at a specific point
relative to all other transactions. So it sounds to me that a similar
approach with write_lock and flush_lock cannot be used for the binlog (but
correct me if I'm wrong).

Instead the binlog uses a different optimisation:


Since the binlog writing is essentially sequential in nature, it is more
efficient to do it in a single thread for all waiting threads, and this is
how the MariaDB binlog group commit is done.

> code. A colleague recently claimed to me that LOCK_commit_ordered is a
> huge bottleneck. As you can see, the write_lock and flush_lock are

But is it really the LOCK_commit_ordered that is a bottleneck? Or is it just
this locks that appears in profiling reports because it is where the
binlog sequencing happens according to points (2), (3), and (4) above? I'm
very interested to learn ways of improving this, but I can easily imagine
that the LOCK_commit_ordered could be misunderstood as the bottleneck of
what is in reality a much deeper issue in how the binlog is operating in the
current implementation.

> It seems to me that the missing tracking of logical time in the data
> structures is causing hangs like MDEV-25611. I feel that the design is

This is a hang with RESET MASTER. I vaguely recall some very tricky locking
semantics around RESET MASTER. My guess would be that complex and
inconsistent locking around RESET MASTER is the cause, rather than the
commit_checkpoint_request() API which is completely asynchroneous and not
expected to cause hangs. But I'll need to look into the code and detail to
be sure.

> based on a wrong assumption that the binlog-covered transactions are
> the only source of InnoDB log writes. That has never been the case:

Hm, there should be no such assumption. As explained above, InnoDB is free
to write whatever and however it wants to its logs. The binlog just needs to
know at _some_ point in time that all the transactions in a specific binlog
file are now durably committed in the engine. That point in time can be
delayed for long (even hours), the only penalty is a slightly longer
recovery in case of crash.

Or am I missing something?

> The task MDEV-18959 (ensuring that one fdatasync() per durable
> transaction suffices) is also related to this. I think that the

Yes, solving the problem of multiple syncs being required is a long-long
time goal, this is a huge bottleneck.
At the root of this is having multiple transactional/redo logs (one in each
engine and one in binlog), as opposed to a single shared log.

> simplest solution (assuming that moving the binlog to a bunch of
> transactional tables is out of the question) is to ensure that one of

This would be one way to effectively use a single log - the updates to the
tables would happen through the InnoDB log. My concerns would include the
unnecessary overhead of duplicating writes (to both log and tablespaces) for
what is an append-only dataset, and how to make this work for multiple
storage engines and multi-engine transactions.

> the logs (binlog and other write-ahead logs) is always ahead. If we
> choose that the binlog must be ahead, then in InnoDB the write_lock

I've always had some concerns about making the binlog the "master" log on
which recovery is based.

One problem is that the current binlog implementation is very naive. It's
not even block-structured and pre-allocated, it's just a simple file being
constantly appended to. This is already double overhead on fdatasync() since
the filesystem must sync both data and metadata (file length).

Another problem is that the binlog is a logical log (SQL statements), and
fragile to base recovery on that _must_ be 100% reliable. If we don't want
to recover InnoDB transactions by replaying binlog replication event groups,
then we anyway need an fdatasync() inside InnoDB in the XA prepare step
before we can be sure the transaction is durable, don't we?

Another idea would be to go in the opposite direction, where the storage
engine could provide a logging service to the server (just like it provides
tables). This way the binlog code could call into InnoDB to write the binlog
records into the InnoDB log. But this requires changes on the InnoDB side
(log archiving for one), and reading binlogs for replication will be slower
as it's interleaved with unrelated log records.

> acquisition would have to be preceded by a binlog_write_lock
> acquisition that will ensure that any relevant changes of persistent
> state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part
> of the InnoDB log write must have been durably written to the binlog.

Yes, it is an interesting idea. What do you see as the mechanism to recover
a transaction inside InnoDB in case of a crash just after writing to the

Very interesting discussion and points Marko, thanks for your writeup, and
looking further to future discussions with you.

 - Kristian.

Follow ups