← Back to team overview

maria-developers team mailing list archive

Re: A problem with implementing Group Commit with Binlog with MyRocks


Kristian, Sergey, hello.

While I'm still studying the Mariadb BGC (binlog group commit) let me
jump in with few comments.

> Sergey Petrunia <sergey@xxxxxxxxxxx> writes:
>> == Some background ==
>> "group commit with binlog" feature needs to accomplish two goals:
>> 1. Keep the binlog and the storage engine in sync.
>>   storage_engine->prepare(sync=true);
>>   binlog->write(sync=true);
>>   storage_engine->commit();
>> 2. The second goal is to make operation performant.  We need two coordinated disk 

>> == Group Commit with Binlog in MySQL ==
>> MySQL (and fb/mysql-5.6 in particular) does in the following phases:
>> Phase #1:
>>   Call storage_engine->prepare() for all transactions in the group.
>>   The call itself is not persistent.
>> Phase #2: Call storage->engine->flush_logs(). 
>>   This makes the effect of all Prepare operations from Phase#1 persistent.
>> Phase #3:
>>   Write and sync the binary log.
>> Phase #4: 
>>   Call storage_engine->commit(). This does not need to be persistent.
> Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook
> backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook
> work?

Facebook contributed, indeed.

Bug #73202 	write/sync redo log before flush thread cache to binlog
Submitted: 	5 Jul 2014 0:31 	Modified: 	24 Nov 2014 2:01
Reporter: 	zhai weixiang (OCA) 

>> MariaDB does not have these phases described above:
>>>  Phase #1:
>>>    Call storage_engine->prepare() for all transactions in the group.
>>>    The call itself is not persistent.
>>>  Phase #2: Call storage->engine->flush_logs(). 
>>>    This makes the effect of all Prepare operations from Phase#1 persistent.
> Right, it combines them in a single "phase". storage_engine->prepare() is
> expected to be persistent and do its own group prepare. While MySQL 5.7
> builds a list of transactions to group prepare.
>> RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc,
>> DBImpl::WriteImpl function) handles both Prepare() and Commit() commands 
>> and does the following:
>> 1. Controls writing the commited data into the MemTable
>> 2. Writes transactions to WAL
>> 3. Syncs the WAL.
> Can you explain, at a high level, how RocksDB transaction visibility, lock
> release, and persistency works?
> Is it like - once a transaction is written to the MemTable, it is visible to
> other transactions and its commit order is determined wrt. other
> transactions?
> And persistency is guaranteed after write+sync of the WAL?
> When are locks released in this sequence?
>> All three steps are done for the whole group. This has a consequence: a
>> Commit() operation that does not need to sync the WAL will still be delayed
>> if another operation in the group needs the WAL to be synced.
> So do I understand correctly: DbImpl::WriteImpl() does both group
> commit

Sergey definitely needed to spell it out. One may even think of
a SELECT that suffers delay.. A fair guess?

> (in-memory) to the MemTable, as well as group commit (on disk) to the WAL?
> And it uses the _same_ grouping of transactions for these two operations?
> And so, the first commit_ordered() joins some prepare() that wants to sync
> the WAL. And only once that sync is done can the next commit_ordered() start
> - and it might easily end up joining a thread that recently completed
> its delayed commit_ordered() and is now doing prepare for a new transaction.
> Indeed, this is not acceptable performance-wise for commit_ordered(). It
> must not wait for disk operations.
> So why is this not a problem in the MySQL case? MySQL runs the
> handlerton->commit() calls under LOCK_commit just like MariaDB does
> commit_ordered().

Sounds like 5.7 (5.6-fb) flush_logs() by the BGC leader triggers
necessary group-prepare coordination. Specifically, I would expect (yet
to look at the Sergey's branch) one fsync() for the whole group by the

> My guess is that this is because in your tests, WriteImpl() was _always_
> called with WAL sync disabled. I wonder what would happen if you were to run
> a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where
> the latter would end up in WriteImpl(sync=true); maybe a similar problem
> would occur.
> So I think there is something interesting to look at here. If I understand
> correctly, WriteImpl() tries to reduce contention between threads doing
> commit (even in-memory commit) by making them group up and having a single
> thread commit for multiple transactions, rather than jumping from one thread
> to another for each commit.
> This is a valid technique, but it seems to fit badly with what MySQL and
> MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a
> single thread. In MySQL, _both_ prepare and commits are so grouped from a
> single thread (though I think one thread can do group prepare in parallel
> with another doing group commit).
> So there seems to be an opportunity to simplify WriteImpl() for the MySQL
> and MariaDB binlog case. If my understanding is correct, there will not be a
> lot of writer grouping.
> (Of course for the non-binlog case, the situation is different).
> Maybe this comment is relevant here?
>     // Requesting sync with concurrent_prepare_ is expected to be very rare. We
>     // hance provide a simple implementation that is not necessarily efficient.
>> == Possible solutions ==
>> I am not sure what to do.
>> - Make the SQL layer's Group Commit implementation invoke hton->flush_logs()
>>   explicitly, like MySQL does?
> That is an option, though I hope you will not do it like that. What does
> flush_logs() have to do with making prepare() persistent?
> You could instead add a new handlerton method group_prepare() or something.
> If non-NULL, the storage engine may omit persistency in prepare(), but must
> then in group_prepare() ensure that all prepares are persistent that have
> completed prepare_ordered(). And if the method is non-null, the SQL layer
> will call group_prepare() just before binlog write (under a new mutex
> LOCK_group_prepare that is chained before LOCK_log.
> This way, the extra lock can be avoided for storage engines that do not need
> group_prepare(). And storage engines have freedom to implement
> group_prepare() in a way that suites them.
> Of course, Rocksdb can just implement group_prepare() as flush_logs() to
> make _all_ prepares persistent, just like in MySQL. So for rocksdb the
> functionality is identical to in MySQL, while flexibility is preserved for
> other storage engines.


> However, it still seems to me that there is opportunity to do better here.
> For example, the upper layer could present to Rocksdb the actual list of
> transactions that need to be group prepared / group committed. Then Rocksdb
> could do them in a single writer without having to coordinate the threads
> manually in WriteImpl().
>> - Modify RocksDB so that Transaction::Commit(sync=false) do not use Group
>>   Write? I am not sure if this is possible: Group Write is not about only 
>>   performance, it's about preventing concurrent MemTable writes.
>>   AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not
>>   participate in write groups and work as if there was no other activity.
> What about if rocksdb got the list of transactions to commit
> (to memtable, sync=false) explicitly, rather than as individual commit() or
> commit_ordered() calls? Then it could commit them all in a single writer,
> which should be more efficient. And similar for prepare?
> In the current MySQL (facebook patch) code, isn't it the case that each
> commit() has to create a new writer and write to memtable a single commit
> individually? While all of these calls are in fact from a single thread from
> an explicit, known list. This does not seem optimal.
>> - Modify RocksDB so that Transaction::Commit(sync=false) does not wait until
>>   its write group finishes WAL sync? This could be doable but is potentially
>>   complex.
> That should probably be done so that a write group would only write to the
> memtable (and the in-memory WAL buffer?). After that, it would release all
> non-syncing participants, and the remaining syncing participants could form
> a new write group to do the sync independently.
> Of course, if a write group rarely syncs, this is of little benefit. From my
> limited understanding of the code, flush_logs() which ends up in SyncWAL()
> does not use a write group.
> So this already ended up as a huge email, but I thought some background on
> commit_ordered() could also help here. Note that in MySQL >=5.6, their
> commit() is very similar to commit_ordered().
> commit_ordered exists for three main purposes, if I recall correctly:
> 1. To synchronise commit order in binlog and storage engine. This ensures
> that if a physical backup is taken of the storage engine and used to
> provision a slave, the storage engine state corresponds to a unique point
> in the binlog (MySQL has this).
> synchronise snapshots between multiple storage engines (MySQL does not have
> this, I think).

(Offtopic, but anyway what it is? Multi-engine transaction with this
specific ISOLATION level?)

> 3. To avoid having to do an extra fsync() for every commit, on top of the
> one for prepare and the one for binlog write (MySQL has this).
> I think those are the main reason for commit_ordered() (I might have
> forgotten some).
> For this problem, I suppose (3) is the main interest?
> MySQL handles (3) by stopping all transactions around binlog rotate and
> doing a flush_logs().

Maybe I am missing something, but why binlog rotation? It is not a
common case.  Indeed MySQL BGC reduces the number of fsync() to two by
the (flush stage) leader.  As to rotation, it's a specific branch of
where a rotator thread contends for the flush stage mutex, eventually gains it
(which may lead to few more groups binlogged into the old being rotated
file) and performs.

> It needs this because after binlog rotation, binlog
> crash recovery has only an empty binlog, so _all_ transactions must be
> durably committed at this point.
> MariaDB avoids this stall

You must be having a use case which I can't see..

> around binlog rotate. Instead it extends binlog
> crash recovery to be able to look into multiple binlog files. So there is no
> need to force commits to disk around binlog rotate.
> To eventually be able to drop binlog files, there is the
> binlog_checkpoint_request() mechanism. This allows the storage engine to
> inform the upper layer when all the transactions in a binlog have ended up
> durably committed, in the normal course of action of the storage engine.
> So to just get (3), RocksDB could just implement no commit_ordered(), or
> perhaps an empty commit_ordered(). And then also implement
> binlog_checkpoint_request() to record the latest prepared transaction at
> that point - and when the WAL is later synced, reply back to the upper layer
> to allow it to release the old binlog file. This seems doable without
> support for quickly committing a transaction to memory, which current
> RocksDB WriteImpl() seems poor at doing simultaneously with persistent
> prepare().
> Though I think it would be good if the full functionality of
> commit_ordered() was implemented in RocksDB. Passing down into RocksDB
> explicitly the list of transactions to group-prepare or group-commit sounds
> like an interesting idea that could potentially benefit performance.
> Thoughts?

Let me reserve my night for more :-).

> Hope this helps, I wanted to present some background on this feature. Please
> let me know of any details you want explained or discussed, and I will try
> to answer them briefly and to the point.
>  - Kristian.



Follow ups