maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #06290
Review of MDEV-4506, parallel replication, part 3
Hi!
This is a review of the latest code pushed to your tree:
bzr diff -r 3662..
Note that I will apply some of the trivial changes myself to the
tree to speed up things (this is mostly about comments).
=== added file 'mysql-test/suite/rpl/r/rpl_parallel.result'
--- mysql-test/suite/rpl/r/rpl_parallel.result 1970-01-01 00:00:00 +0000
+++ mysql-test/suite/rpl/r/rpl_parallel.result 2013-09-22 20:25:20 +0000
Great that you got these done!
Will make my life much easier when I work the parallel replication
code!
<cut>
=== modified file 'sql/log.cc'
--- sql/log.cc 2013-07-04 22:26:15 +0000
+++ sql/log.cc 2013-09-22 20:25:20 +0000
@@ -6542,27 +6542,93 @@ MYSQL_BIN_LOG::write_transaction_to_binl
}
}
+
+/*
+ Put a transaction that is ready to commit in the group commit queue.
+ The transaction is identified by the ENTRY object passed into this function.
+
+ To facilitate group commit for the binlog, we first queue up ourselves in
+ this function. Then later the first thread to enter the queue waits for
+ the LOCK_log mutex, and commits for everyone in the queue once it gets the
+ lock. Any other threads in the queue just wait for the first one to finish
+ the commit and wake them up. This way, all transactions in the queue get
+ committed in a single disk operation.
+
+ The return value of this function is TRUE if queued as the first entry in
+ the queue (meaning this is the leader), FALSE otherwise.
I moved this last in the comment as:
@retval TRUE If queued as the first entry in the queue (meaning this
is the leader)
@retval FALSE Otherwise
+
+ The main work in this function is when the commit in one transaction has
+ been marked to wait for the commit of another transaction to happen
+ first. This is used to support in-order parallel replication, where
+ transactions can execute out-of-order but need to be committed in-order with
+ how they happened on the master. The waiting of one commit on another needs
+ to be integrated with the group commit queue, to ensure that the waiting
+ transaction can participate in the same group commit as the waited-for
+ transaction.
+
+ So when we put a transaction in the queue, we check if there were other
+ transactions already prepared to commit but just waiting for the first one
+ to commit. If so, we add those to the queue as well, transitively for all
+ waiters.
+*/
+
<cut>
@@ -6574,10 +6640,28 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
This would be natural to do with recursion, but we want to avoid
potentially unbounded recursion blowing the C stack, so we use the list
approach instead.
+
+ We keep a list of all the waiters that need to be processed in `list',
+ linked through the next_subsequent_commit pointer. Initially this list
+ contains only the entry passed into this function.
+
+ We process entries in the list one by one. The element currently being
+ processed is pointed to by `cur`, and the element at the end of the list
+ is pointed to by `last` (we do not use NULL to terminate the list).
+
+ As we process an element, it is first added to the group_commit_queue.
+ Then any waiters for that element are added at the end of the list, to
+ be processed in subsequent iterations. This continues until the list
+ is exhausted, with all elements ever added eventually processed.
+
+ The end result is a breath-first traversal of the tree of waiters,
+ re-using the next_subsequent_commit pointers in place of extra stack
+ space in a recursive traversal.
I added:
The temporary list created in next_subsequent_commit is not
used by the caller or any other function.
@@ -6670,6 +6763,18 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
{
if (list->wakeup_subsequent_commits_running)
{
+ /*
+ ToDo: We should not need a full lock/unlock of LOCK_wait_commit
+ here. All we need is a single (full) memory barrier, to ensure that
+ the reads of the list above are not reordered with the write of
+ wakeup_subsequent_commits_running, or with the writes to the list
+ from other threads that is allowed to happen after
+ wakeup_subsequent_commits_running has been set to false.
+
+ We do not currently have explicit memory barrier primitives in the
+ source tree, but if we get them the below mysql_mutex_lock() could
+ be replaced with a full memory barrier just before the loop.
+ */
ok (for now).
Another way would of course to only do a mutex lock for the first
entry we find...
No more comments. Everything looked fine!
I will start coordinate with you tomorrow about things to do.
Regards,
Monty