maria-developers team mailing list archive

Thread
Date

Re: A problem with implementing Group Commit with Binlog with MyRocks

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: andrei.elkin@xxxxxxxxxx
Date: Tue, 12 Sep 2017 16:40:34 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, andrei.elkin@xxxxxxxxxxx, MyRocks - RocksDB storage engine for MySQL <myrocks-dev@xxxxxxxxxxxxxxxx>
In-reply-to: <87ingpgifw.fsf@urd.knielsen-hq.org> (Kristian Nielsen's message of "Mon, 11 Sep 2017 21:10:27 +0200")
Organization: Home sweet home
Razorgate-kas: Status: not_detected
Razorgate-kas: Rate: 0
Razorgate-kas: Envelope from:
Razorgate-kas: Version: 5.5.3
Razorgate-kas: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29
Razorgate-kas: Lua profiles 69136 [Nov 12 2014]
Razorgate-kas: Method: none
Reply-to: andrei.elkin@xxxxxxxxxx
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Kristian,

> andrei.elkin@xxxxxxxxxx writes:
>
>>> 2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly
>>> synchronise snapshots between multiple storage engines (MySQL does not have
>>> this, I think).
>>
>> (Offtopic, but anyway what it is? Multi-engine transaction with this
>> specific ISOLATION level?)
>
>   https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transaction-with-consistent-snapshot/
>
> So it is like a REPEATABLE READ across engines, applications can get a
> consistent view of cross-engine transactions.
> It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ
> LOCK.
>
>>> 3. To avoid having to do an extra fsync() for every commit, on top of the
>>> one for prepare and the one for binlog write (MySQL has this).
>
>>> MySQL handles (3) by stopping all transactions around binlog rotate and
>>> doing a flush_logs().
>>
>> Maybe I am missing something, but why binlog rotation? It is not a
>> common case.  Indeed MySQL BGC reduces the number of fsync() to two by
>> the (flush stage) leader.  As to rotation, it's a specific branch of
>>  MYSQL_BIN_LOG::ordered_commit()
>> where a rotator thread contends for the flush stage mutex, eventually gains it
>> (which may lead to few more groups binlogged into the old being rotated
>> file) and performs.
>
> It used to be that there was _three_ fsyncs for every commit. The _only_
> reason the fsync in commit was needed was to ensure that binlog crash
> recovery would still work after a binlog rotation. Which was kind of
> silly.

This one must be

https://github.com/mysql/mysql-server/commit/35adf21bb63a336c76efdad6c4610161f3fd733d

>
> So _something_ needed to be done aroung binlog rotation. To ensure that all
> transactions are durably committed in storage engines before they are no
> longer available to binlog crash recovery.
>
> If I understand correctly, MySQL ensures this by temporarily stopping binlog
> writes, calling flush_logs() in all (?) engines (with semantics that
> flush_logs() must make all prior commit()'s durable), and only then allowing
> new writes to the new binlog. I am not sure how MySQL ensures that all
> commit() calls complete before the flush_logs() call, maybe it takes both
> the LOCK_commit and LOCK_log mutexes around binlog rotation.

Almost: LOCK_log and LOCK_xids -  I've checked it out, 'cos really forgot it.
MySQL employs a sort of unlogging xid former (before BGC) technique.
The rotator first (phtread_cond-)waits for all flushed-to-binlog-xids got committed,
and then having the two mutex grant ha_flush_logs() is issued right
before the new log file is set.

> The result is
> that binlog crash recovery is always possible from only one binlog file.
>
> MariaDB instead extends binlog crash recovery to consider multiple binlog
> files, if necessary. Then nothing special is needed during binlog rotation.
> But some "garbage collection" is needed to eventually release old binlog
> files.
>
>>> MariaDB avoids this stall
>>
>> You must be having a use case which I can't see..
>
> I am not sure one is better than the other. MariaDB avoids flush_logs(),
> though in current storage engines it may not matter much. The MySQL approach
> is arguably simpler code, though it seems quite an abuse of flush_logs().

It must be only for good to have two well explored methods around.

> The MySQL approach was not public when the MariaDB approach was
> implemented.

True.

Thanks a lot for talking and explaining these fine bits!

Andrei

References

A problem with implementing Group Commit with Binlog with MyRocks
From: Sergey Petrunia, 2017-09-08
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: Kristian Nielsen, 2017-09-11
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: andrei . elkin, 2017-09-11
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: Kristian Nielsen, 2017-09-11