← Back to team overview

maria-developers team mailing list archive

commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)

 

Monty asked me to fix MDEV-11937. This particular one is a performance
regression in InnoDB commit. But there is a wider problem that I thought I
should explain, so it can be perhaps avoided in the future.

MariaDB and MySQL use different mechanisms for storage engines to avoid
having to fsync during commit when binlog is enabled. In MariaDB, storage
engines implement the commit_checkpoint_request() handlerton method. In
MySQL, storage engines call thd_get_durability_property() to check if they
can avoid fsync.

The problem is that somehow the thd_get_durability_property() function was
introduced into MariaDB code, but it is completely non-functional. So now
there is code in InnoDB, TokuDB and RocksDB that calls this function and
does not work correctly. This lead to performance regression due to extra
fsync() calls.

This seems to me a serious problem. Now new code can be merged and compile
fine, where in reality it is wrong. There really should not be two separate
and different mechanisms for the same thing, and certainly not with one of
them non-functional.

The "expected" approach would be to remove thd_get_durability_property() and
update storage engines to use the corresponding MariaDB APIs
(commit_ordered() and commit_checkpoint_request()). This should not be hard.
A simple commit_checkpoint_request() implementation can just fsync all
transactions immediately (similar to what MySQL does). A more detailed
implementation can avoid any extra fsyncs, and just asynchroneously notify
the upper layer with commit_checkpoint_notify_ha() later when such fsync
happens normally (this is what InnoDB does). See comments in handler.h for
details.

Or was the intention to eventually replace the whole MariaDB binlog group
commit implementation with the MySQL one, to make MariaDB less divergent?
This would require a number of changes to MariaDB binlog and replication.

The binlog recovery code should be replaced (MySQL does not have the ability
to recover from more than one binlog). The binlog group commit code must be
replaced as well, and the commit_ordered() mechanism removed. I think this
would also require a re-design of the MariaDB in-order parallel replication.
MySQL has some optional mechanism for in-order, but my understanding is that
it is not sufficient to do optimistic parallel replication. I am not
intimately familiar with the MySQL code though.

I hope this helps. The wider problem behind MDEV-11937 is one of policy more
than one of code bugs, so not too much else I can do to address it.

-----------------------------------------------------------------------

Incidentally, I noticed some code in InnoDB trx0trx.cc:

    /* We set the HA_IGNORE_DURABILITY during prepare phase of
    binlog group commit to not flush redo log for every transaction
    here. So that we can flush prepared records of transactions to
    redo log in a group right before writing them to binary log
    during flush stage of binlog group commit. */

So the idea is to do group prepare with the same group of transactions that
will later group commit to the binlog. In MariaDB, this concept does not
exist. Storage engine prepares are allowed to run in parallel and in any
order compared to binlog commit. So the InnoDB group prepare can include
more transactions than participate in the binlog commit (but also less, of
course).

IIRC, the important thing is to ensure that all transactions are durably
prepared in storage engines before being written to the binlog. In MariaDB,
there is MYSQL_LOG_BIN::group_commit_queue that holds the list of
transactions to be group committed to the binlog.

A similar mechanism in MariaDB might use the group_commit_queue as the set
of transactions to send to storage engines for group prepare. But if we wait
for a prepare fsync() after building this list, some transactions that could
prepare during this fsync may be unnecesarily delayed to the next binlog
commit.

The MySQL 5.7 code grabs the list of transactions to binlog group commit,
and then flushes the log in _all_ storage engines, unconditionally. That
just seems horrybly wrong, but I can't see any other way to read the code
(from MYSQL_BIN_LOG::process_flush_stage_queue()):

  ha_flush_logs(NULL, true);

It even does so while holding LOCK_log :-( I guess the MySQL idea is that
there is only one storage engine anyway, InnoDB.

 - Kristian.


Follow ups