← Back to team overview

maria-developers team mailing list archive

A problem with implementing Group Commit with Binlog with MyRocks

 

Hello,

This is about https://jira.mariadb.org/browse/MDEV-11934. I've encountered 
an insteresting issue here, so I thought I would consult on both MyRocks and
MariaDB lists.

== Some background ==

"group commit with binlog" feature needs to accomplish two goals:

1. Keep the binlog and the storage engine in sync.
  This is done by employing XA between the binlog and the storage engine. It
  works by making these calls:
  
  /*
    Make the transaction's changes to be ready to be committed (no conflicts
    with other transactions, etc) but do not commit them yet.
    The effects of the prepare operation must be synced to disk, as the storage 
    engine needs to be able to recover (i.e. commit) the prepared transaction 
    after a crash
  */
  storage_engine->prepare(sync=true);
  
  /* 
    After this call, the transaction is considered committed. In case of a crash,
    the recovery process will use the contents of the binlog to determine which
    of the prepared transactions are to be committed and which are to be rolled
    back.
  */
  binlog->write(sync=true);
  
  /*
    Commit the transaction in the storage engine. This makes its changes visible 
    to other transactions (and also releases its locks and so forth)
    Note that most of the time(*) we don't need to sync there. In case of a crash
    we will be able to recover using the binlog.
  */
  storage_engine->commit();

2. The second goal is to make operation performant.  We need two coordinated disk 
  flushes per transaction, the idea is to do "Group Commit" where multiple
  transactions share disk flushes.
  So, we need to do group commit and keep the storage engine and the binlog in
  sync while doing that.

== Group Commit with Binlog in MySQL ==

MySQL (and fb/mysql-5.6 in particular) does in the following phases:

Phase #1:
  Call storage_engine->prepare() for all transactions in the group.
  The call itself is not persistent.

Phase #2: Call storage->engine->flush_logs(). 
  This makes the effect of all Prepare operations from Phase#1 persistent.

Phase #3:
  Write and sync the binary log.

Phase #4: 
  Call storage_engine->commit(). This does not need to be persistent.

MyRocks implements them.


== Group Commit with Binlog in MariaDB ==

MariaDB does not have these phases described above:

>  Phase #1:
>    Call storage_engine->prepare() for all transactions in the group.
>    The call itself is not persistent.
>
>  Phase #2: Call storage->engine->flush_logs(). 
>    This makes the effect of all Prepare operations from Phase#1 persistent.

A quote from Kristian's description at 
https://lists.launchpad.net/maria-developers/msg10832.html

>> So the idea is to do group prepare with the same group of transactions that
>> will later group commit to the binlog. In MariaDB, this concept does not
>> exist. Storage engine prepares are allowed to run in parallel and in any
>> order compared to binlog commit.

Initially this looked like it could work for MyRocks.

MyRocks has a group commit implementation, both Prepare() and Commit()
operations participate in groups.

However when I implemented a group commit implementation I found its
performance to be close to what one would expect if there was no commit 
grouping, and commit() call flushed to disk

https://jira.mariadb.org/browse/MDEV-11934 has the details.

== The issue ==

(I'm 95% certain about this. It's not 100% yet but it is very likely)

RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc,
DBImpl::WriteImpl function) handles both Prepare() and Commit() commands 
and does the following:

1. Controls writing the commited data into the MemTable
2. Writes transactions to WAL
3. Syncs the WAL.

All three steps are done for the whole group. This has a consequence: a
Commit() operation that does not need to sync the WAL will still be delayed
if another operation in the group needs the WAL to be synced.

This delay has a disastrous effect, because SQL layer tries to have the same 
order of transactions in the storage engine and in the binlog. In order to do 
that, it calls rocksdb_commit_ordered() for each transaction sequentially. 
Delaying one transaction causes a delay of the entire SQL-level commit group.


== Possible solutions ==

I am not sure what to do.

- Make the SQL layer's Group Commit implementation invoke hton->flush_logs()
  explicitly, like MySQL does?

- Modify RocksDB so that Transaction::Commit(sync=false) do not use Group
  Write? I am not sure if this is possible: Group Write is not about only 
  performance, it's about preventing concurrent MemTable writes.
  AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not
  participate in write groups and work as if there was no other activity.

- Modify RocksDB so that Transaction::Commit(sync=false) does not wait until
  its write group finishes WAL sync? This could be doable but is potentially
  complex.


BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog




Follow ups