← Back to team overview

maria-developers team mailing list archive

Review of MDEV-4506, parallel replication, part 3

 

Hi!

This is a review of the latest code pushed to your tree:

bzr diff -r 3662..

Note that I will apply some of the trivial changes myself to the
tree to speed up things (this is mostly about comments).

=== added file 'mysql-test/suite/rpl/r/rpl_parallel.result'
--- mysql-test/suite/rpl/r/rpl_parallel.result	1970-01-01 00:00:00 +0000
+++ mysql-test/suite/rpl/r/rpl_parallel.result	2013-09-22 20:25:20 +0000

Great that you got these done!
Will make my life much easier when I work the parallel replication
code!

<cut>

=== modified file 'sql/log.cc'
--- sql/log.cc	2013-07-04 22:26:15 +0000
+++ sql/log.cc	2013-09-22 20:25:20 +0000
@@ -6542,27 +6542,93 @@ MYSQL_BIN_LOG::write_transaction_to_binl
   }
 }
 
+
+/*
+  Put a transaction that is ready to commit in the group commit queue.
+  The transaction is identified by the ENTRY object passed into this function.
+
+  To facilitate group commit for the binlog, we first queue up ourselves in
+  this function. Then later the first thread to enter the queue waits for
+  the LOCK_log mutex, and commits for everyone in the queue once it gets the
+  lock. Any other threads in the queue just wait for the first one to finish
+  the commit and wake them up. This way, all transactions in the queue get
+  committed in a single disk operation.
+
+  The return value of this function is TRUE if queued as the first entry in
+  the queue (meaning this is the leader), FALSE otherwise.

I moved this last in the comment as:

  @retval  TRUE   If queued as the first entry in the queue (meaning this
                  is the leader)
  @retval FALSE   Otherwise        

+
+  The main work in this function is when the commit in one transaction has
+  been marked to wait for the commit of another transaction to happen
+  first. This is used to support in-order parallel replication, where
+  transactions can execute out-of-order but need to be committed in-order with
+  how they happened on the master. The waiting of one commit on another needs
+  to be integrated with the group commit queue, to ensure that the waiting
+  transaction can participate in the same group commit as the waited-for
+  transaction.
+
+  So when we put a transaction in the queue, we check if there were other
+  transactions already prepared to commit but just waiting for the first one
+  to commit. If so, we add those to the queue as well, transitively for all
+  waiters.
+*/
+
<cut>

@@ -6574,10 +6640,28 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
     This would be natural to do with recursion, but we want to avoid
     potentially unbounded recursion blowing the C stack, so we use the list
     approach instead.
+
+    We keep a list of all the waiters that need to be processed in `list',
+    linked through the next_subsequent_commit pointer. Initially this list
+    contains only the entry passed into this function.
+
+    We process entries in the list one by one. The element currently being
+    processed is pointed to by `cur`, and the element at the end of the list
+    is pointed to by `last` (we do not use NULL to terminate the list).
+
+    As we process an element, it is first added to the group_commit_queue.
+    Then any waiters for that element are added at the end of the list, to
+    be processed in subsequent iterations. This continues until the list
+    is exhausted, with all elements ever added eventually processed.
+
+    The end result is a breath-first traversal of the tree of waiters,
+    re-using the next_subsequent_commit pointers in place of extra stack
+    space in a recursive traversal.

I added:

The temporary list created in next_subsequent_commit is not
used by the caller or any other function.


@@ -6670,6 +6763,18 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
     {
       if (list->wakeup_subsequent_commits_running)
       {
+        /*
+          ToDo: We should not need a full lock/unlock of LOCK_wait_commit
+          here. All we need is a single (full) memory barrier, to ensure that
+          the reads of the list above are not reordered with the write of
+          wakeup_subsequent_commits_running, or with the writes to the list
+          from other threads that is allowed to happen after
+          wakeup_subsequent_commits_running has been set to false.
+
+          We do not currently have explicit memory barrier primitives in the
+          source tree, but if we get them the below mysql_mutex_lock() could
+          be replaced with a full memory barrier just before the loop.
+        */

ok (for now).

Another way would of course to only do a mutex lock for the first
entry we find...

No more comments. Everything looked fine!

I will start coordinate with you tomorrow about things to do.

Regards,
Monty