← Back to team overview

maria-developers team mailing list archive

Re: Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)


Marko Mäkelä <marko.makela@xxxxxxxxxxx> writes:

>> recent binlog files. We want to know when scanning the old binlog file is no
>> longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
>> But we don't want to stall waiting for everything to be flushed to disk
>> immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
> Is this the original purpose and motivation of the
> handlerton::commit_checkpoint_request?


> I think that for InnoDB to guess the last binlog-committed transaction
> would require adding a thread-local variable, which could cause
> trouble with the connection thread pool. An added function parameter
> would be much simpler.

I agree that a function parameter seems simpler. We're requesting to be
notified when a specific commit is durable in the engine, better specify
_which_ commit than for InnoDB to guess, as you say :-).

And an engine is free to ignore the parameter, and just make sure all
existing transactions at that point are committed (eg. current lsn), as my
original InnoDB implementation did.

We need to be careful about lifetime. The transaction may no longer exist as
a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the
corresponding LSN, as was suggested in another mail, and then
commit_checkpoint_request() could pass that value.

Something like that seems indeed a better API.

>> There seems to be some lock-free operations, I wonder if that makes sense,
>> this is not timing-critical code (it's called once per binlog rotation).
> It was a long time ago when I worked on this. The InnoDB log_sys.mutex
> certainly was a bottleneck before some InnoDB log related data
> structures were refactored to use C++11 std::atomic.

I was thinking that the log_flush_request list does not need anything fancy,
and could just use a normal mutex, it is so rarely accessed. And any more
critical code (eg. using log_sys.mutex) could just do an unsafe check for
NULL log_flush_request list, at the worst skipping a binlog checkpoint
notification until the next flush.

But all that becomes more complicated when RESET MASTER will hang until the
checkpoint notification arrives. Then a lost checkpoint notification becomes
a problem on an idle server. That's why I'm unhappy about the way RESET
MASTER is done now (even though I implemented it, IIRC). It really was my
intention that skipping a binlog checkpoint from time to time should be
allowed to simplify things in the engine.

> The 10.4 innobase_mysql_log_notify() follows the anti-pattern
> https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#double-checked-locking
> The comment that claims that it is safe might not hold for
> architectures that use a weak memory model, such as typical
> implementations of ARM, POWER, and RISC-V (RVWMO).
> I believe that the
> https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering
> that is employed by the 10.5 innodb_log_flush_request() and
> log_flush_notify() should be safer.

Right. As long as RESET MASTER doesn't tolerate delayed notifications, all
of this becomes a concern :-(

> If the last write to the log was an InnoDB internal one, it could be
> done by log_write_up_to(lsn, false).
> The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere
> closer to log_sys.lsn) if there was a subsequent call to
> log_write_up_to(lsn, true).

What I found in the test case I added to MDEV-25611 is that InnoDB
flushes the redo log every second (assuming I read the code
right). This is the function srv_sync_log_buffer_in_background(). Hm, but
actually this can be controlled by the option srv_flush_log_at_timeout, so
my testcase could be improved to set this to a high value instead of
disabling srv_sync_log_buffer_in_background() in the code.

> Right, I did not think of multi-engine transactions. Perhaps there
> could be a special single-engine transaction mode? If the binlog is

Yes. single-engine transaction is surely the important usecase to optimise.
It's nice if multi-engine transactions still work, but if they require
multiple fsync still, I think that's perfectly fine, not something to
allocate a lot of resources to optimise for.

> One more thing: The post
> https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing
> the fact that when the "fast part" of an InnoDB transaction state
> change has finished, the LSN of the transaction COMMIT (or XA PREPARE
> or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to
> reorder transactions after that point. In fact, after the completion
> of the "fast part", other transactions in the system will observe the
> state change of the transaction, even before it becomes durable.

Thanks. Yes, this sounds like what I would expect.

> Another thing: Could replication make use of the log write completion
> mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to
> MariaDB Server 10.6.0?

Ah, this is interesting, thanks for the pointer.
I will need to look into it more to fully understand, but it does seem like
something that might be of use. Not just for user threads writing to the
binlog, but maybe even more so for parallel replication worker threads.
Because of the wait_for_prior_commit() requirements for in-order parallel
replication, a _lot_ of THDs can be waiting for binlog commit, and being
able to suspend these THDs and let a worker thread do something else
in-between could potentially be very beneficial, I think.

I also had the idea to use fibers/coroutines as is mentined in the MDEV
description, but if that can be avoided, so much the better.

 - Kristian.

Follow ups