← Back to team overview

maria-developers team mailing list archive

Re: Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)


On Fri, May 12, 2023 at 8:32 PM Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> The transaction of interest is the last one that called commit_ordered()
> prior to this call of commit_checkpoint_request(). I guess binlog code could
> save it (it's trivial as commit_ordered() calls are serialised), or InnoDB
> could do it itself.

I think that for InnoDB to guess the last binlog-committed transaction
would require adding a thread-local variable, which could cause
trouble with the connection thread pool. An added function parameter
would be much simpler.

> But let's take it from what we actually want to achive. We have created a
> new binlog file, and now crash recovery will have to scan the two most
> recent binlog files. We want to know when scanning the old binlog file is no
> longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
> But we don't want to stall waiting for everything to be flushed to disk
> immediately; we can just log the CHECKPOINT_EVENT later when appropriate.

Is this the original purpose and motivation of the

> There seems to be some lock-free operations, I wonder if that makes sense,
> this is not timing-critical code (it's called once per binlog rotation).

It was a long time ago when I worked on this. The InnoDB log_sys.mutex
certainly was a bottleneck before some InnoDB log related data
structures were refactored to use C++11 std::atomic.
I might have originally misunderstood the source code comment in
sql/handler.h that was added in MDEV-232.

The 10.4 innobase_mysql_log_notify() follows the anti-pattern

The comment that claims that it is safe might not hold for
architectures that use a weak memory model, such as typical
implementations of ARM, POWER, and RISC-V (RVWMO).
I believe that the
that is employed by the 10.5 innodb_log_flush_request() and
log_flush_notify() should be safer.

> There's also some complex logic to avoid missing a
> commit_checkpoint_notify_ha(). I can see how that is important and could
> occur in a completely idle server. But I think it can be done simpler, by
> simply scheduling a new check of this every second or so as long as there
> are pending checkpoint requests. This is all asynchronous, no point in
> trying to report the checkpoint immediately.

According to a comment in https://jira.mariadb.org/browse/MDEV-25611
there is an https://rr-project.org/ trace of a hang available to
Andrei Elkin. It is also pretty easy to reproduce the hang by using
some debug instrumentation that will make InnoDB to try harder to
avoid issuing log checkpoints (and therefore trigger another call to
log_flush_notify() which will therefore "rescue" the hang). It would
be very interesting to know what happened concurrently with or after
the last call to innodb_log_flush_request().

> If InnoDB is running non-durably, I think there is no point in checking the
> log at all. It doesn't help that we scan the required log files if the
> transactions to be recoved from InnoDB were not prepared durably and are
> unrecoverable anyway. So in non-durable mode InnoDB could just call
> commit_checkpoint_notify_ha() immediately?

I do not think that we should spend too much effort to optimize an
unsafe, corruption-prone mode of operation. We should be safe by
default (https://jira.mariadb.org/browse/MDEV-16589) and optimize that

> I notice that RESET MASTER also does a commit_checkpoint_request() and waits
> for the notification. I can understand why I did it this way, but in
> hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so
> we should not wait on it. RESET MASTER surely cannot be time critical, so
> probably it would be better to just ask InnoDB to flush everything to its
> log, skipping the checkpoint mechanism. Or just do nothing, does it make any
> sense to recover the binlog into a consistent state with InnoDB when we just
> deleted all of the binlog?!?

If RESET MASTER can rewind the binlog position, then it might make
sense to ensure that the new binlog position is durably written to the
persistent InnoDB transaction metadata storage.

> Anyway, these are comments on the current code. I still didn't fully
> understand what the problem is with the current API wrt. InnoDB?

Based on your description, it does not feel like a big design problem.
I would very much like to see an analysis and fix of the MDEV-25611
hang. It could be something fairly trivial, or a sign of a more
serious problem in the design.

> I have an old worklog where this idea was explored in some detail. It is not
> trivial, there are many corner cases where replaying binlog events will not
> reliably recover the state (eg. statement-based binlogging). People have
> learned to live with this for replication, but now that will be extended to
> the binlog. One "interesting" corner case is a multi-engine transaction that
> ended up durable in one engine but needs replay in another engine. Maybe all
> this is tolerable and we can require row-mode binlogging or something, the
> gain will be huge for some workloads.

Right, I did not think of multi-engine transactions. Perhaps there
could be a special single-engine transaction mode? If the binlog is
guaranteed to be durably written ahead of the engine log, then we
should be able to skip the MySQLXid mechanism as well. However, I do
not know how much that might end up saving.

> > store the latest committed binlog file name and offset. I do not know
> > where exactly that information is being used.
> (I think the information with binlog file name and offset is used to
> provision a new slave from an innobackup or LVM/BTRFS snapshot?).

Yes, something like that. See
https://jira.mariadb.org/browse/MDEV-22351 and

One more thing: The post
https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing
the fact that when the "fast part" of an InnoDB transaction state
change has finished, the LSN of the transaction COMMIT (or XA PREPARE
or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to
reorder transactions after that point. In fact, after the completion
of the "fast part", other transactions in the system will observe the
state change of the transaction, even before it becomes durable.

Another thing: Could replication make use of the log write completion
mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to
MariaDB Server 10.6.0?
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc

Follow ups