maria-developers team mailing list archive

Thread
Date

Re: A problem with implementing Group Commit with Binlog with MyRocks

To: andrei.elkin@xxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 11 Sep 2017 21:10:27 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, andrei.elkin@xxxxxxxxxxx, MyRocks - RocksDB storage engine for MySQL <myrocks-dev@xxxxxxxxxxxxxxxx>
In-reply-to: <8760cpozpb.fsf@quad> (andrei elkin's message of "Mon, 11 Sep 2017 21:30:24 +0300")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)

andrei.elkin@xxxxxxxxxx writes:

>> 2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly
>> synchronise snapshots between multiple storage engines (MySQL does not have
>> this, I think).
>
> (Offtopic, but anyway what it is? Multi-engine transaction with this
> specific ISOLATION level?)

  https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transaction-with-consistent-snapshot/

So it is like a REPEATABLE READ across engines, applications can get a
consistent view of cross-engine transactions.
It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ
LOCK.

>> 3. To avoid having to do an extra fsync() for every commit, on top of the
>> one for prepare and the one for binlog write (MySQL has this).

>> MySQL handles (3) by stopping all transactions around binlog rotate and
>> doing a flush_logs().
>
> Maybe I am missing something, but why binlog rotation? It is not a
> common case.  Indeed MySQL BGC reduces the number of fsync() to two by
> the (flush stage) leader.  As to rotation, it's a specific branch of
>  MYSQL_BIN_LOG::ordered_commit()
> where a rotator thread contends for the flush stage mutex, eventually gains it
> (which may lead to few more groups binlogged into the old being rotated
> file) and performs.

It used to be that there was _three_ fsyncs for every commit. The _only_
reason the fsync in commit was needed was to ensure that binlog crash
recovery would still work after a binlog rotation. Which was kind of silly.

So _something_ needed to be done aroung binlog rotation. To ensure that all
transactions are durably committed in storage engines before they are no
longer available to binlog crash recovery.

If I understand correctly, MySQL ensures this by temporarily stopping binlog
writes, calling flush_logs() in all (?) engines (with semantics that
flush_logs() must make all prior commit()'s durable), and only then allowing
new writes to the new binlog. I am not sure how MySQL ensures that all
commit() calls complete before the flush_logs() call, maybe it takes both
the LOCK_commit and LOCK_log mutexes around binlog rotation. The result is
that binlog crash recovery is always possible from only one binlog file.

MariaDB instead extends binlog crash recovery to consider multiple binlog
files, if necessary. Then nothing special is needed during binlog rotation.
But some "garbage collection" is needed to eventually release old binlog
files.

>> MariaDB avoids this stall
>
> You must be having a use case which I can't see..

I am not sure one is better than the other. MariaDB avoids flush_logs(),
though in current storage engines it may not matter much. The MySQL approach
is arguably simpler code, though it seems quite an abuse of flush_logs().
The MySQL approach was not public when the MariaDB approach was implemented.

 - Kristian.

Follow ups

Re: A problem with implementing Group Commit with Binlog with MyRocks
From: andrei . elkin, 2017-09-12

References

A problem with implementing Group Commit with Binlog with MyRocks
From: Sergey Petrunia, 2017-09-08
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: Kristian Nielsen, 2017-09-11
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: andrei . elkin, 2017-09-11