maria-developers team mailing list archive

Thread
Date
Re: A problem with implementing Group Commit with Binlog with MyRocks

To: Sergey Petrunia <sergey@xxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 11 Sep 2017 16:29:43 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, andrei.elkin@xxxxxxxxxxx, MyRocks - RocksDB storage engine for MySQL <myrocks-dev@xxxxxxxxxxxxxxxx>
In-reply-to: <20170908150508.GB2264@pslp2> (Sergey Petrunia's message of "Fri, 8 Sep 2017 18:05:08 +0300")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)
Sergey Petrunia <sergey@xxxxxxxxxxx> writes:

> == Some background ==
>
> "group commit with binlog" feature needs to accomplish two goals:
>
> 1. Keep the binlog and the storage engine in sync.
>   storage_engine->prepare(sync=true);
>   binlog->write(sync=true);
>   storage_engine->commit();
>
> 2. The second goal is to make operation performant.  We need two coordinated disk 

> == Group Commit with Binlog in MySQL ==
>
> MySQL (and fb/mysql-5.6 in particular) does in the following phases:
>
> Phase #1:
>   Call storage_engine->prepare() for all transactions in the group.
>   The call itself is not persistent.
>
> Phase #2: Call storage->engine->flush_logs(). 
>   This makes the effect of all Prepare operations from Phase#1 persistent.
>
> Phase #3:
>   Write and sync the binary log.
>
> Phase #4: 
>   Call storage_engine->commit(). This does not need to be persistent.

Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook
backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook
work?

> MariaDB does not have these phases described above:
>
>>  Phase #1:
>>    Call storage_engine->prepare() for all transactions in the group.
>>    The call itself is not persistent.
>>
>>  Phase #2: Call storage->engine->flush_logs(). 
>>    This makes the effect of all Prepare operations from Phase#1 persistent.

Right, it combines them in a single "phase". storage_engine->prepare() is
expected to be persistent and do its own group prepare. While MySQL 5.7
builds a list of transactions to group prepare.

> RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc,
> DBImpl::WriteImpl function) handles both Prepare() and Commit() commands 
> and does the following:
>
> 1. Controls writing the commited data into the MemTable
> 2. Writes transactions to WAL
> 3. Syncs the WAL.

Can you explain, at a high level, how RocksDB transaction visibility, lock
release, and persistency works?

Is it like - once a transaction is written to the MemTable, it is visible to
other transactions and its commit order is determined wrt. other
transactions?

And persistency is guaranteed after write+sync of the WAL?

When are locks released in this sequence?

> All three steps are done for the whole group. This has a consequence: a
> Commit() operation that does not need to sync the WAL will still be delayed
> if another operation in the group needs the WAL to be synced.

So do I understand correctly: DbImpl::WriteImpl() does both group commit
(in-memory) to the MemTable, as well as group commit (on disk) to the WAL?
And it uses the _same_ grouping of transactions for these two operations?

And so, the first commit_ordered() joins some prepare() that wants to sync
the WAL. And only once that sync is done can the next commit_ordered() start
- and it might easily end up joining a thread that recently completed
its delayed commit_ordered() and is now doing prepare for a new transaction.

Indeed, this is not acceptable performance-wise for commit_ordered(). It
must not wait for disk operations.

So why is this not a problem in the MySQL case? MySQL runs the
handlerton->commit() calls under LOCK_commit just like MariaDB does
commit_ordered().

My guess is that this is because in your tests, WriteImpl() was _always_
called with WAL sync disabled. I wonder what would happen if you were to run
a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where
the latter would end up in WriteImpl(sync=true); maybe a similar problem
would occur.


So I think there is something interesting to look at here. If I understand
correctly, WriteImpl() tries to reduce contention between threads doing
commit (even in-memory commit) by making them group up and having a single
thread commit for multiple transactions, rather than jumping from one thread
to another for each commit.

This is a valid technique, but it seems to fit badly with what MySQL and
MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a
single thread. In MySQL, _both_ prepare and commits are so grouped from a
single thread (though I think one thread can do group prepare in parallel
with another doing group commit).

So there seems to be an opportunity to simplify WriteImpl() for the MySQL
and MariaDB binlog case. If my understanding is correct, there will not be a
lot of writer grouping.

(Of course for the non-binlog case, the situation is different).

Maybe this comment is relevant here?

    // Requesting sync with concurrent_prepare_ is expected to be very rare. We
    // hance provide a simple implementation that is not necessarily efficient.


> == Possible solutions ==
>
> I am not sure what to do.
>
> - Make the SQL layer's Group Commit implementation invoke hton->flush_logs()
>   explicitly, like MySQL does?

That is an option, though I hope you will not do it like that. What does
flush_logs() have to do with making prepare() persistent?

You could instead add a new handlerton method group_prepare() or something.
If non-NULL, the storage engine may omit persistency in prepare(), but must
then in group_prepare() ensure that all prepares are persistent that have
completed prepare_ordered(). And if the method is non-null, the SQL layer
will call group_prepare() just before binlog write (under a new mutex
LOCK_group_prepare that is chained before LOCK_log.

This way, the extra lock can be avoided for storage engines that do not need
group_prepare(). And storage engines have freedom to implement
group_prepare() in a way that suites them.

Of course, Rocksdb can just implement group_prepare() as flush_logs() to
make _all_ prepares persistent, just like in MySQL. So for rocksdb the
functionality is identical to in MySQL, while flexibility is preserved for
other storage engines.

However, it still seems to me that there is opportunity to do better here.
For example, the upper layer could present to Rocksdb the actual list of
transactions that need to be group prepared / group committed. Then Rocksdb
could do them in a single writer without having to coordinate the threads
manually in WriteImpl().

> - Modify RocksDB so that Transaction::Commit(sync=false) do not use Group
>   Write? I am not sure if this is possible: Group Write is not about only 
>   performance, it's about preventing concurrent MemTable writes.
>   AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not
>   participate in write groups and work as if there was no other activity.

What about if rocksdb got the list of transactions to commit
(to memtable, sync=false) explicitly, rather than as individual commit() or
commit_ordered() calls? Then it could commit them all in a single writer,
which should be more efficient. And similar for prepare?

In the current MySQL (facebook patch) code, isn't it the case that each
commit() has to create a new writer and write to memtable a single commit
individually? While all of these calls are in fact from a single thread from
an explicit, known list. This does not seem optimal.

> - Modify RocksDB so that Transaction::Commit(sync=false) does not wait until
>   its write group finishes WAL sync? This could be doable but is potentially
>   complex.

That should probably be done so that a write group would only write to the
memtable (and the in-memory WAL buffer?). After that, it would release all
non-syncing participants, and the remaining syncing participants could form
a new write group to do the sync independently.

Of course, if a write group rarely syncs, this is of little benefit. From my
limited understanding of the code, flush_logs() which ends up in SyncWAL()
does not use a write group.


So this already ended up as a huge email, but I thought some background on
commit_ordered() could also help here. Note that in MySQL >=5.6, their
commit() is very similar to commit_ordered().

commit_ordered exists for three main purposes, if I recall correctly:

1. To synchronise commit order in binlog and storage engine. This ensures
that if a physical backup is taken of the storage engine and used to
provision a slave, the storage engine state corresponds to a unique point
in the binlog (MySQL has this).

2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly
synchronise snapshots between multiple storage engines (MySQL does not have
this, I think).

3. To avoid having to do an extra fsync() for every commit, on top of the
one for prepare and the one for binlog write (MySQL has this).

I think those are the main reason for commit_ordered() (I might have
forgotten some).

For this problem, I suppose (3) is the main interest?

MySQL handles (3) by stopping all transactions around binlog rotate and
doing a flush_logs(). It needs this because after binlog rotation, binlog
crash recovery has only an empty binlog, so _all_ transactions must be
durably committed at this point.

MariaDB avoids this stall around binlog rotate. Instead it extends binlog
crash recovery to be able to look into multiple binlog files. So there is no
need to force commits to disk around binlog rotate.

To eventually be able to drop binlog files, there is the
binlog_checkpoint_request() mechanism. This allows the storage engine to
inform the upper layer when all the transactions in a binlog have ended up
durably committed, in the normal course of action of the storage engine.

So to just get (3), RocksDB could just implement no commit_ordered(), or
perhaps an empty commit_ordered(). And then also implement
binlog_checkpoint_request() to record the latest prepared transaction at
that point - and when the WAL is later synced, reply back to the upper layer
to allow it to release the old binlog file. This seems doable without
support for quickly committing a transaction to memory, which current
RocksDB WriteImpl() seems poor at doing simultaneously with persistent
prepare().

Though I think it would be good if the full functionality of
commit_ordered() was implemented in RocksDB. Passing down into RocksDB
explicitly the list of transactions to group-prepare or group-commit sounds
like an interesting idea that could potentially benefit performance.

Thoughts?

Hope this helps, I wanted to present some background on this feature. Please
let me know of any details you want explained or discussed, and I will try
to answer them briefly and to the point.

 - Kristian.
Follow ups

Re: A problem with implementing Group Commit with Binlog with MyRocks
From: andrei . elkin, 2017-09-11
Re: A problem with implementing Group Commit with Binlog with MyRocks
From: Kristian Nielsen, 2017-09-11
References

A problem with implementing Group Commit with Binlog with MyRocks
From: Sergey Petrunia, 2017-09-08