maria-developers team mailing list archive

Thread
Date
Re: Architecture review of MWL#116 "Efficient group commit for binary log"

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxxx>
Date: Thu, 2 Sep 2010 12:03:55 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87tyn0qlvu.fsf@knielsen-hq.org>
User-agent: Mutt/1.5.16 (2007-06-09)
Hi, Kristian!

Let's start from the WL description:

> While currently there is no server guarantee to get same commit order
> in engines an binlog (except for the InnoDB prepare_commit_mutex
> hack), there are several reasons why this could be desirable:
...
>  - Other backup methods could have similar needs, eg. XtraBackup or
>    `mysqldump --single-transaction`, to have consistent commit order
>    between binlog and storage engines without having to do FLUSH
>    TABLES WITH READ LOCK or similar expensive blocking operation.
>    (other backup methods, like LVM snapshot, don't need consistent
>    commit order, as they can restore out-of-order commits during crash
>    recovery using XA).

I doubt mysqldump --single-transaction cares about physical commit
ordering in logs.

>  - If we have consistent commit order, we can think about optimising
>    commit to need only one fsync (for binlog); lost commits in storage
>    engines can then be recovered from the binlog at crash recovery by
>    re-playing against the engine from a particular point in the
>    binlog.

Why do you need consistent commit order for that?

>  - With consistent commit order, we can get better semantics for START
>    TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions
>    (and we could even get it to return also a matching binlog
>    position).  Currently, this "CONSISTENT SNAPSHOT" can be
>    inconsistent among multiple storage engines.

Okay, although it's a corner case.

It would be more important to have consistent reads working across
multiple storage engines without explicit START TRANSACTION WITH
CONSISTENT SNAPSHOT. But it's doable too.
 
>  - In InnoDB, the performance in the presense of hotspots can be
>    improved if we can release row locks early in the commit phase, but
>    this requires that we release them in the same order as commits in
>    the binlog to ensure consistency between master and slaves.

Conceptually InnoDB can release the locks only after the commit was made
persistent.

So, I think it's the same as "optimising commit to need only one
fsync" - if we only sync the binlog, InnoDB can release its locks right
after binlog sync.  And the question is the same, why do you need
consistent commit order for that? (I suppose, the answer will be the
same too :)

>  - There was some discussions around Galera [1] synchroneous
>    replication and global transaction ID that it needed consistent
>    commit order among participating engines.
>  - I believe there could be other applications for guaranteed
>    consistent commit order, and that the architecture described in
>    this worklog can implement such guarantee with reasonable overhead.

Okay, this much I agree with - there could be applications (galera, may
be) that need this consistent commit order. But because it's not needed
in many (if not most) cases, it'd be good to be able to switch it off,
if it'll help to gain performance.

================  High-Level Specification ===================
...
> This group_log_xid() method also is more efficient, as it avoids some
> inter-thread synchronisation. Since group_log_xid() is serialised, we
> can run it together with all the commit_ordered() method calls and
> need only a single sequential code section. With the log_xid()
> methods, we would need first a sequential part for the
> prepare_ordered() calls, then a parallel part with log_xid() calls (to
> not loose group commit ability for log_xid()), then again a sequential
> part for the commit_ordered() method calls.

How could that work ? If you run log_xid() is parallel, then you don't
know in what order xid's are logged - which means, you don't know the
commit order in the binlog. And if you don't know it, you cannot have
the same order in the engines.
 
> The extra synchronisation is needed, as each commit_ordered() call
> will have to wait for log_xid() in one thread (if log_xid() fails then
> commit_ordered() should not be called), and also wait for
> commit_ordered() to finish in all threads handling earlier commits. In
> effect we will need to bounce the execution from one thread to the
> other among all participants in the group commit.

I failed to understand the last paragraph :(
 
> As a consequence of the group_log_xid() optimisation, handlers must be
> aware that the commit_ordered() call can happen in another thread than
> the one running commit() (so thread local storage is not available).
> This should not be a big issue as the THD is available for storing any
> needed information.

I do not see how this is a consequence of the optimization.
 
> Since group_log_xid() runs for multiple transactions in a single
> thread, it can not do error reporting (my_error()) as that relies on
> thread local storage. Instead it sets an error code in THD::xid_error,
> and if there is an error then later another method will be called (in
> correct thread context) to actually report the error:
> 
>     int xid_delayed_error(THD *thd)
> 
> The three new methods prepare_ordered(), group_log_xid(), and
> commit_ordered() are optional (as is xid_delayed_error). A storage
> engine or transaction coordinator is free to not implement them if
> they are not needed. In this case there will be no order guarantee for
> the corresponding stage of group commit for that engine. For example,
> InnoDB needs no ordering of the prepare phase, so can omit
> implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at all,
> so does not need to implement any of them.

Hm. handlerton::prepare_ordered() and handlerton::commit_ordered() are
truly optional, but TC_LOG::group_log_xid() is a replacement for
TC_LOG::log_xid().

I wonder if it would be cleaner to remove TC_LOG::log_xid() (from
TC_LOG_MMAP too) completely and not keep the redundant functionality.
 
> -----------------------------------------------------------------------
> Some possible alternative for this worklog:
> 
>  - Alternatively, we could eliminate log_xid() and require that all
>  transaction coordinators implement group_log_xid() instead, again for
>  some moderate simplification.

indeed, that's what I mean
 
>  - At the moment there is no plugin actually using prepare_ordered(),
>  so, it could be removed from the design. But it fits in well, is
>  efficient to implement, and could be useful later (eg. for the
>  requested feature of releasing locks early in InnoDB).

every time when a new api is implemented there is no plugin actually
using it :)

> -----------------------------------------------------------------------
> Some possible follow-up projects after this is implemented:
> 
>  - Add statistics about how efficient group commit is
>  (#fsyncs/#commits in each engine and binlog).

this is so simple that can be done im this WL.
 
>  - Implement an XtraDB prepare_ordered() methods that can release row
>  locks early (Mark Callaghan from Facebook advocates this, but need to
>  determine exactly how to do this safely).
> 
>  - Implement a new crash recovery algorithm that uses the consistent
>  commit ordering to need only fsync() for the binlog. At crash
>  recovery, any missing transactions in an engine is replayed from the
>  correct point in the binlog (this point must be stored
>  transactionally inside the engine, as XtraDB already does today).

I'd say it's important optimization - can you turn it into a WL ?
It deserves more than a passing mentioning at the end of another WL.

Perhaps other follow-up projects may be also turned into WLs.
 
>  - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really
>  gets a consistent snapshow, with same set of committed and not
>  committed transactions in all engines, 2) returns a corresponding
>  consistent binlog position. This should be easy by piggybacking on
>  the synchronisation implemented for ha_commit_trans().
> 
>  - Use this in XtraBackup to get consistent binlog position without
>  having to block all updates with FLUSH TABLES WITH READ LOCK.
> 
================   Low-Level Design ===================
> 
> 1. Changes for ha_commit_trans()
> 
> The gut of the code for commit is in the function ha_commit_trans()
> (and in commit_one_phase() which is called from it). This must be
> extended to use the new prepare_ordered(), group_log_xid(), and
> commit_ordered() calls.
> 
> 1.1 Atomic queue of committing transactions
> 
> To keep the right commit order among participants, we put transactions
> into a queue. The operations on the queue are non-locking:
> 
>  - Insert THD at the head of the queue, and return old queue.
> 
>     THD *enqueue_atomic(THD *thd)
> 
>  - Fetch (and delete) the whole queue.
> 
>     THD *atomic_grab_reverse_queue()
> 
> These are simple to implement with atomic compare-and-set. Note that
> there is no ABA problem [2], as we do not delete individual elements
> from the queue, we grab the whole queue and replace it with NULL.
> 
> A transaction enters the queue when it does prepare_ordered(). This
> way, the scheduling order for prepare_ordered() calls is what
> determines the sequence in the queue and effectively the commit order.

How can that work ? If you insert into a queue without a lock, you
cannot know what order of commits you eend up with. So, you cannot have
the same order as you've got fof prepare_ordered. Besides, if
prepare_ordered is, precisely, *ordered*, it must be done under a lock.
I see two possibilities of doing prepare_ordered and enqueue:

1.

  pthread_mutex_lock(queue_lock);
  prepare_ordered(thd);
  enqueue(thd);
  pthread_mutex_unlock(queue_lock);

in this case prepare_ordered calls are strictly serialized by a mutex.
There is no need for non-locking enqueue_atomic, because the same mutex
must be used to define the queue ordering.

2.

  enqueue_atomic();

  ... some time later ...
  for (thd=queue_head; thd; thd= thd->next_in_queue)
    prepare_ordered(thd);

in this case insert into a queue is non-locking, the order of elements
in a queue is undefined. Later on, you iterate the queue and call
prepare_ordered in the queue order. In this implementation
all prepare_ordered and commit_ordered will likely be called from one
thread, not from the connection thread of this transaction.

==========
Note the detail: for START TRANSACTION WITH CONSISTENT SNAPSHOT and
inter-engine consistency to work, all connection threads must use the
same commit order across all engines (similar to deadlock avoidance that
MySQL uses - where all threads lock tables in the same order). A simple
way to do it would be to change trans_reqister_ha and instead of adding
participating engines to a list, mark them in an array:

  bool engines[MAX_HA];
  trans_reqister_ha() {
     ...
     engines[hton->slot]=1;
     ...
  }

and in ha_prepare_ordered() you call hton->prepare_ordered()
for all engines that have non-zero value in engines[] array.

To make the above pseudo-code simpler, I've used an array of bool,
but it would be better to have an array of void* pointers and let
the engine to store any transaction specific data there, just like it
can use thd->ha_data[] and savepoint memory.

> The queue is grabbed by the code doing group_log_xid() and
> commit_ordered() calls. The queue is passed directly to
> group_log_xid(), and afterwards iterated to do individual
> commit_ordered() calls.
> 
> Using a lock-free queue allows prepare_ordered() (for one transaction)
> to run in parallel with commit_ordered (in another transaction),
> increasing potential parallelism.

Nope, in the approach 1 (with a mutex) as above, you can have this too.
Simply as

  1.
    ... on commit ...
    pthread_mutex_lock(queue_lock);
    commit_queue= queue;
    queue=NULL;
    pthread_mutex_unlock(queue_lock);
    ... now you can iterate the commit_queue in parallel ...
    ... with preparing other transactions ...

actually you'll have to do it not on commit but before group_log_xid().
 
> The queue is simply a linked list of THD objects, linked through a
> THD::next_commit_ordered field. Since we add at the head of the queue,
> the list is actually in reverse order, so must be reversed when we
> grab and delete it.

Better to have a list of transaction objects, if possible - as
eventually we'll need to decouple st_transaction and THD to be able to
put transactions on hold and continue them in another thread, as
needed for external XA.

We aren't doing it now, but if possible, it'd better to avoid creating
new dependencies between THD and st_transaction.
 
> The reason that enqueue_atomic() returns the old queue is so that we
> can check if an insert goes to the head of the queue. The thread at
> the head of the queue will do the sequential part of group commit for
> everyone.
>
> 1.2 Locks
> 
> 1.2.1 Global LOCK_prepare_ordered
> 
> This lock is taken to serialise calls to prepare_ordered(). Note that
> effectively, the commit order is decided by the order in which threads
> obtain this lock.

okay, in this case it's approach 1, enqueue must happen under the same
lock and there's no need for a lock-free implementation
 
> 1.2.2 Global LOCK_group_commit and COND_group_commit
> 
> This lock is used to protect the serial part of group commit. It is
> taken around the code where we grab the queue, call group_log_xid() on
> the queue, and call commit_ordered() on each element of the queue, to
> make sure they happen serialised and in consistent order. It also
> protects the variable group_commit_queue_busy, which is used when not
> using group_log_xid() to delay running over a new queue until the
> first queue is completely done.
> 
> 1.2.3 Global LOCK_commit_ordered
> 
> This lock is taken around calls to commit_ordered(), to ensure they
> happen serialised.

if you have only one thread iterating the queue and calling
commit_ordered(), they'll be always serialized. why do you need a
mutex ?
 
> 1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
> 
> This lock protects the thd->group_commit_ready variable, as well as
> the condition variable used to wake up threads after log_xid() and
> commit_ordered() finishes.
> 
> 1.2.5 Global LOCK_group_commit_queue
> 
> This is only used on platforms with no native compare-and-set
> operations, to make the queue operations atomic.
>
> 1.3 Commit algorithm.
> 
> This is the basic algorithm, simplified by
> 
>  - omitting some error handling
> 
>  - omitting looping over all handlers when invoking handler methods
> 
>  - omitting some possible optimisations when not all calls needed (see
>  next section).
> 
>  - Omitting the case where no group_log_xid() is used, see below.
> 
> ---- BEGIN ALGORITHM ----
>     ht->prepare()
> 
>     // Call prepare_ordered() and enqueue in correct commit order
>     lock(LOCK_prepare_ordered)
>     ht->prepare_ordered()
>     old_queue= enqueue_atomic(thd)

so, indeed, you enqueue_atomic under a mutex. Why do you want lock-free
queue implementation ?

>     thd->group_commit_ready= FALSE
>     is_group_commit_leader= (old_queue == NULL)
>     unlock(LOCK_prepare_ordered)
> 
>     if (is_group_commit_leader)
> 
>         // The first in queue handles group commit for everyone
> 
>         lock(LOCK_group_commit)
>         // Wait while queue is busy, see below for when this occurs
>         while (group_commit_queue_busy)
>             cond_wait(COND_group_commit)
> 
>         // Grab and reverse the queue to get correct order of transactions
>         queue= atomic_grab_reverse_queue()
> 
>         // This call will set individual error codes in thd->xid_error
>         // It also sets the cookie for unlog() in thd->xid_cookie
>         group_log_xid(queue)
> 
>         lock(LOCK_commit_ordered)
>         for (other IN queue)
>             if (!other->xid_error)
>                 ht->commit_ordered()
>         unlock(LOCK_commit_ordered)

commit_ordered() is called from one thread, and it's protected
by LOCK_group_commit. There's no need for LOCK_commit_ordered.
 
>         unlock(LOCK_group_commit)
> 
>         // Now we are done, so wake up all the others.
>         for (other IN TAIL(queue))
>             lock(other->LOCK_commit_ordered)
>             other->group_commit_ready= TRUE
>             cond_signal(other->COND_commit_ordered)
>             unlock(other->LOCK_commit_ordered)

why do you want one condition per thread ?
You can use one condition per queue, such as:

     else
         // If not the leader, just wait until leader did the work for us.
         lock(old_queue->LOCK_commit_ordered)
         while (!old_queue->group_commit_ready)
             cond_wait(old_queue->LOCK_commit_ordered, old_queue->COND_commit_ordered)
         unlock(old_queue->LOCK_commit_ordered)

this way you'll need just one broadcast to wake up everybody.

>     else
>         // If not the leader, just wait until leader did the work for us.
>         lock(thd->LOCK_commit_ordered)
>         while (!thd->group_commit_ready)
>             cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
>         unlock(other->LOCK_commit_ordered)

This doesn't work, I'm afraid. The leader can complete its job and
signal all threads in a queue before some of the queue threads has
locked their thd->LOCK_commit_ordered mutex. These threads will miss the
wakeup signal.

To close all loopholes, I think, you can use my suggestion above - with
one broadcast and waiting on old_queue->COND_commit_ordered - and every
thread, including the leader, must lock the leader->LOCK_commit_ordered
before unlocking LOCK_prepare_ordered. That is, it'll be

     // Call prepare_ordered() and enqueue in correct commit order
     lock(LOCK_prepare_ordered)
     ht->prepare_ordered()
     old_queue= queue_head
     enqueue(thd)
     leader= queue_head
     lock(leader->LOCK_commit_ordered)
     unlock(LOCK_prepare_ordered)

     if (leader != thd)
       while (!leader->group_commit_done)
         cond_wait(leader->LOCK_commit_ordered, leader->COND_commit_ordered)
     else
       leader->group_commit_done= false;
       .... the rest of your group commit code
       broadcast(leader->COND_commit_ordered);
       unlock(leader->LOCK_commit_ordered);

>     // Finally do any error reporting now that we're back in own thread.
>     if (thd->xid_error)
>         xid_delayed_error(thd)
>     else
>         ht->commit(thd)
>         unlog(thd->xid_cookie, thd->xid)
> ---- END ALGORITHM ----
... 
I've skipped the algorithm that uses old log_xid()
...
> 2. Binlog code changes (log.cc)
>
> The bulk of the work needed for the binary log is to extend the code
> to allow group commit to the log. Unlike InnoDB/XtraDB, there is no
> existing support inside the binlog code for group commit.
> 
> The existing code runs most of the write + fsync to the binary lock
> under the global LOCK_log mutex, preventing any group commit.
> 
> To enable group commit, this code must be split into two parts:
> 
>  - one part that runs per transaction, re-writing the embedded event
>  positions for the correct offset, and writing this into the in-memory
>  log cache.
> 
>  - another part that writes a set of transactions to the disk, and
>  runs fsync().
> 
> Then in group_log_xid(), we can run the first part in a loop over all
> the transactions in the passed-in queue, and run the second part only
> once.
> 
> The binlog code also has other code paths that write into the binlog,
> eg. non-transactional statements. These have to be adapted also to
> work with the new code.
> 
> In order to get some group commit facility for these also, we change
> that part of the code in a similar way to ha_commit_trans. We keep
> another, binlog-internal queue of such non-transactional binlog
> writes, and such writes queue up here before sleeping on the LOCK_log
> mutex. Once a thread obtains the LOCK_log, it loops over the queue for
> the fast part, and does the slow part once, then finally wakes up the
> others in the queue.

I wouldn't bother implementing any grouping for non-transactional
engines. They aren't crash safe, so practically users of these engines
don't need or use --sync-binlog anyway.
 
> I think the group commit is mostly useful in transactional workloads
> anyway (non-transactional engines will loose data anyway in case of
> crash, so why fsync() after each transaction?)

Yes, exactly

> 3. XtraDB changes (ha_innodb.cc)
> 
> The changes needed in XtraDB are comparatively simple, as XtraDB
> already implements group commit, it just needs to be enabled with the
> new commit_ordered() call.
> 
> The existing commit() method already is logically in two parts. The
> first part runs under the prepare_commit_mutex() and must be run in
> same order as binlog commit. This part needs to be moved to
> commit_ordered(). The second part runs after releasing
> prepare_commit_mutex and does transaction log write+fsync; it can
> remain.
> 
> Then the prepare_commit_mutex is removed (and the
> enable_unsafe_group_commit XtraDB option to disable it).
> 
> There are two asserts that check that the thread running the first
> part of XtraDB commit is the same as the thread running the other
> operations for the transaction. These have to be removed (as
> commit_ordered() can run in a different thread). Also an error
> reporting with sql_print_error() has to be delayed until commit()
> time.
>
> 4. Proof-of-concept implementation
> 
> There is a proof-of-concept implementation of this architecture, in
> the form of a quilt patch series [3].
> 
> A quick benchmark was done, with sync_binlog=1 and
> innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
> transactions against one table.
> 
> Without the patch, we get only 25 queries per second.
> 
> With the patch, we get 650 queries per second.

Please do RQG transactional tests (Philip can help with that)
to verify that ACID wasn't compromized

> 6. Alternative implementations
> 
>  - The binlog code maintains its own extra atomic transaction queue to
>  handle non-transactional commits in a good way together with
>  transactional (with respect to group commit). Alternatively, we could
>  ignore this issue and just give up on group commit for
>  non-transactional statements, for some code simplifications.

Right, that's what I suggested above.
Seems like whatever I write, you have the same later in the WL :)
 
>  - It would probably be a good idea to implement
>  TC_LOG_MMAP::group_log_xid() (should not be hard).

Agree (I wrote that above, too :)

Regards,
Sergei
Follow ups

Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Kristian Nielsen, 2010-09-03