← Back to team overview

maria-developers team mailing list archive

Re: 答复: in-order commit

 

Hi,

I implemented the last part of in-order commit, which pushes the wait into the
transaction coordinator, so that group commit can work and performance can be
good.

On the one hand I am really pleased to get this done, it is something I have
been thinking on for 3 years now. On the other hand I realise this is fairly
complex stuff, so please be aware that I am 100% open to suggestions for any
changes to this or other ideas on how to proceed.

I ran some quick benchmarks. What I did was setup a master with 20000
independent inserts into a table. I run with sync_binlog=1,
innodb_flush_log_at_trx_commit=1, and --log-slave-updates.

The base MariaDB needs 71 seconds to replicate the 20000 transactions.

Your original patch needs 12 seconds.

With my in-order patch 15 seconds are needed.

But with my in-order patch and increasing the number of threads from 16 to 24,
then just 11 seconds are needed.

  71 seconds  Base
  12 seconds  Original @ 16 threads
  15 seconds  In-order @ 16 threads
  11 seconds  In-order @ 24 threads

So for this quick benchmark, in-order is somewhat slower, but one can
compensate for this by increasing the number of threads. This makes sense;
with in-order there will be some threads waiting, so adding more threads is
needed to ensure enough non-waiting threads to get full performance.

This is great results I think, in-order appears quite viable performance-wise.
There are a number of things that becomes easier when commits on the slave are
guaranteed to be in-order (such as global transaction id).

(And btw, that's 6 times faster replication without the user having to do
anything special, which is also *very* nice! I am really looking forward to
getting this fully integrated in MariaDB).

On the other hand, it is clear that some workloads will suffer under
in-order. For example something like this:

    UPDATE t1 SET a=5 WHERE id=10;
    UPDATE t1 SET a=4 WHERE id=10;
    UPDATE t1 SET a=3 WHERE id=10;
    UPDATE t1 SET a=2 WHERE id=10;
    UPDATE t1 SET a=1 WHERE id=10;
    UPDATE t1 SET a=1 WHERE id=20;
    UPDATE t1 SET a=2 WHERE id=20;
    UPDATE t1 SET a=3 WHERE id=20;
    UPDATE t1 SET a=4 WHERE id=20;
    UPDATE t1 SET a=5 WHERE id=20;

With out-of-order, all the id=20 updates can run in parallel with all the
id=10 updates. But with in-order, the first id=20 update will only commit
after all the id=10 updates have run, so the remaining id=20 updates can not
run in parallel. Performance will be slower, unless there are more events
deeper down in the binlog which can be run in parallel instead.

It is hard to predict how common such cases will be, but I am at least
hopeful!

To fix a potential deadlock with MyISAM, I changed conflict detection. Now for
non-transactional tables, the hash key will have only the table name (not the
PK values). Thus, any two updates of the same MyISAM tables will be a
conflict. Updates to different tables are ok to run in parallel.

I pushed the changes as usual to

    lp:~knielsen/maria/dingqi-parallel-replication/

and I attached the full patch.

This completes the in-order experiment for me. Of course the patch still needs
more work to be finished, like it should probably be possible to
enable/disable in-order with an option etc.

But probably first we should discuss more if in-order is a good idea at all,
and in general how to proceed with the integration of the parallel replication
feature.

A big benefit of the in-order method is that users will be able to enable it
without fear that their applications will break. Unlike for example the MySQL
5.6 multi-threaded slave, there is no need to partition the data into
different schemas and audit/rewrite all applications to ensure no cross-schema
queries. With in-order things will work exactly as normal, it is invisible to
applications. Only thing is that row-based is required to get speedup, but
everything works correctly even if some statement-based events turn up (and
if we combine it with http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184
then we can even do some statement-based parallel replication also using the
in-order stuff). And we can still have out-of-order as an option.

 - Kristian.

=== modified file 'include/mysql/plugin.h'
--- include/mysql/plugin.h	2012-03-28 17:26:00 +0000
+++ include/mysql/plugin.h	2013-01-15 13:02:54 +0000
@@ -700,6 +700,41 @@ void *thd_get_ha_data(const MYSQL_THD th
 */
 void thd_set_ha_data(MYSQL_THD thd, const struct handlerton *hton,
                      const void *ha_data);
+
+
+/**
+  Signal that the first part of handler commit is finished, and that the
+  committed transaction is now visible and has fixed commit ordering with
+  respect to other transactions. The commit need _not_ be durable yet, and
+  typically will not be when this call makes sense.
+
+  This call is optional, if the storage engine does not call it the upper
+  layer will after the handler commit() method is done. However, the storage
+  engine may choose to call it itself to increase the possibility for group
+  commit.
+
+  In-order parallel replication uses this to apply different transaction in
+  parallel, but delay the commits of later transactions until earlier
+  transactions have committed first, thus achieving increased performance on
+  multi-core systems while still preserving full transaction consistency.
+
+  The storage engine can call this from within the commit() method, typically
+  after the commit record has been written to the transaction log, but before
+  the log has been fsync()'ed. This will allow the next replicated transaction
+  to proceed to commit before the first one has done fsync() or similar. Thus,
+  it becomes possible for multiple sequential replicated transactions to share
+  a single fsync() inside the engine in group commit.
+
+  Note that this method should _not_ be called from within the commit_ordered()
+  method, or any other place in the storage engine. When commit_ordered() is
+  used (typically when binlog is enabled), the transaction coordinator takes
+  care of this and makes group commit in the storage engine possible without
+  any other action needed on the part of the storage engine. This function
+  thd_wakeup_subsequent_commits() is only needed when no transaction
+  coordinator is used, meaning a single storage engine and no binary log.
+*/
+void thd_wakeup_subsequent_commits(MYSQL_THD thd);
+
 #ifdef __cplusplus
 }
 #endif

=== modified file 'include/mysql/plugin_audit.h.pp'
--- include/mysql/plugin_audit.h.pp	2012-03-28 17:26:00 +0000
+++ include/mysql/plugin_audit.h.pp	2013-01-15 13:02:54 +0000
@@ -235,6 +235,7 @@ void mysql_query_cache_invalidate4(void*
 void *thd_get_ha_data(const void* thd, const struct handlerton *hton);
 void thd_set_ha_data(void* thd, const struct handlerton *hton,
                      const void *ha_data);
+void thd_wakeup_subsequent_commits(void* thd);
 struct mysql_event_general
 {
   unsigned int event_subclass;

=== modified file 'include/mysql/plugin_auth.h.pp'
--- include/mysql/plugin_auth.h.pp	2012-03-28 17:26:00 +0000
+++ include/mysql/plugin_auth.h.pp	2013-01-15 13:02:54 +0000
@@ -235,6 +235,7 @@ void mysql_query_cache_invalidate4(void*
 void *thd_get_ha_data(const void* thd, const struct handlerton *hton);
 void thd_set_ha_data(void* thd, const struct handlerton *hton,
                      const void *ha_data);
+void thd_wakeup_subsequent_commits(void* thd);
 #include <mysql/plugin_auth_common.h>
 typedef struct st_plugin_vio_info
 {

=== modified file 'include/mysql/plugin_ftparser.h.pp'
--- include/mysql/plugin_ftparser.h.pp	2012-03-28 17:26:00 +0000
+++ include/mysql/plugin_ftparser.h.pp	2013-01-15 13:02:54 +0000
@@ -188,6 +188,7 @@ void mysql_query_cache_invalidate4(void*
 void *thd_get_ha_data(const void* thd, const struct handlerton *hton);
 void thd_set_ha_data(void* thd, const struct handlerton *hton,
                      const void *ha_data);
+void thd_wakeup_subsequent_commits(void* thd);
 enum enum_ftparser_mode
 {
   MYSQL_FTPARSER_SIMPLE_MODE= 0,

=== modified file 'sql/handler.cc'
--- sql/handler.cc	2012-09-22 14:11:40 +0000
+++ sql/handler.cc	2013-01-15 13:02:54 +0000
@@ -1389,6 +1389,8 @@ int ha_commit_one_phase(THD *thd, bool a
   */
   bool is_real_trans=all || thd->transaction.all.ha_list == 0;
   DBUG_ENTER("ha_commit_one_phase");
+  if (is_real_trans)
+    thd->wait_for_prior_commit();
   int res= commit_one_phase_2(thd, all, trans, is_real_trans);
   DBUG_RETURN(res);
 }
@@ -1428,7 +1430,10 @@ commit_one_phase_2(THD *thd, bool all, T
   }
   /* Free resources and perform other cleanup even for 'empty' transactions. */
   if (is_real_trans)
+  {
+    thd->wakeup_subsequent_commits();
     thd->transaction.cleanup();
+  }
 
   DBUG_RETURN(error);
 }
@@ -1503,7 +1508,10 @@ int ha_rollback_trans(THD *thd, bool all
   }
   /* Always cleanup. Even if nht==0. There may be savepoints. */
   if (is_real_trans)
+  {
+    thd->wakeup_subsequent_commits();
     thd->transaction.cleanup();
+  }
   if (all)
     thd->transaction_rollback_request= FALSE;
 

=== modified file 'sql/log.cc'
--- sql/log.cc	2012-12-05 14:05:37 +0000
+++ sql/log.cc	2013-01-15 13:02:54 +0000
@@ -6216,44 +6216,199 @@ MYSQL_BIN_LOG::write_transaction_to_binl
 }
 
 bool
-MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
+MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry,
+                                      wait_for_commit *wfc)
 {
+  group_commit_entry *orig_queue;
+  wait_for_commit *list, *cur, *last;
+
   /*
     To facilitate group commit for the binlog, we first queue up ourselves in
     the group commit queue. Then the first thread to enter the queue waits for
     the LOCK_log mutex, and commits for everyone in the queue once it gets the
     lock. Any other threads in the queue just wait for the first one to finish
     the commit and wake them up.
+
+    To support in-order parallel replication with group commit, after we add
+    some transaction to the queue, we check if there were other transactions
+    already prepared to commit but just waiting for the first one to commit.
+    If so, we add those to the queue as well, transitively for all waiters.
   */
 
   entry->thd->clear_wakeup_ready();
   mysql_mutex_lock(&LOCK_prepare_ordered);
-  group_commit_entry *orig_queue= group_commit_queue;
-  entry->next= orig_queue;
-  group_commit_queue= entry;
-
-  if (entry->cache_mngr->using_xa)
-  {
-    DEBUG_SYNC(entry->thd, "commit_before_prepare_ordered");
-    run_prepare_ordered(entry->thd, entry->all);
-    DEBUG_SYNC(entry->thd, "commit_after_prepare_ordered");
+  orig_queue= group_commit_queue;
+
+  /*
+    Iteratively process everything added to the queue, looking for waiters,
+    and their waiters, and so on. If a waiter is ready to commit, we
+    immediately add it to the queue; if not we just wake it up.
+
+    This would be natural to do with recursion, but we want to avoid
+    potentially unbounded recursion blowing the C stack, so we use the list
+    approach instead.
+  */
+  list= wfc;
+  cur= list;
+  last= list;
+  for (;;)
+  {
+    /* Add the entry to the group commit queue. */
+    entry->next= group_commit_queue;
+    group_commit_queue= entry;
+
+    if (entry->cache_mngr->using_xa)
+    {
+      DEBUG_SYNC(entry->thd, "commit_before_prepare_ordered");
+      run_prepare_ordered(entry->thd, entry->all);
+      DEBUG_SYNC(entry->thd, "commit_after_prepare_ordered");
+    }
+
+    if (!cur)
+      break;             // Can happen if initial entry has no wait_for_commit
+
+    if (cur->subsequent_commits_list)
+    {
+      bool have_lock;
+      wait_for_commit *waiter;
+
+      mysql_mutex_lock(&cur->LOCK_wait_commit);
+      have_lock= true;
+      waiter= cur->subsequent_commits_list;
+      /* Check again, now safely under lock. */
+      if (waiter)
+      {
+        /* Grab the list of waiters and process it. */
+        cur->subsequent_commits_list= NULL;
+        do
+        {
+          wait_for_commit *next= waiter->next_subsequent_commit;
+          group_commit_entry *entry2=
+            (group_commit_entry *)waiter->opaque_pointer;
+          if (entry2)
+          {
+            /*
+              This is another transaction ready to be written to the binary
+              log. We can put it into the queue directly, without needing a
+              separate context switch to the other thread. We just set a flag
+              so that the other thread will know when it wakes up that it was
+              already processed.
+
+              So put it at the end of the list to be processed in a subsequent
+              iteration of the outer loop.
+            */
+            entry2->queued_by_other= true;
+            last->next_subsequent_commit= waiter;
+            last= waiter;
+            /*
+              As a small optimisation, we do not actually need to set
+              waiter->next_subsequent_commit to NULL, as we can use the
+              pointer `last' to check for end-of-list.
+            */
+          }
+          else
+          {
+            /*
+              Wake up the waiting transaction.
+
+              For this, we need to set the "wakeup running" flag and release
+              the waitee lock to avoid a deadlock, see comments on
+              THD::wakeup_subsequent_commits2() for details.
+            */
+            if (have_lock)
+            {
+              cur->wakeup_subsequent_commits_running= true;
+              mysql_mutex_unlock(&cur->LOCK_wait_commit);
+              have_lock= false;
+            }
+            waiter->wakeup();
+          }
+          waiter= next;
+        } while (waiter);
+      }
+      if (have_lock)
+        mysql_mutex_unlock(&cur->LOCK_wait_commit);
+    }
+    if (cur == last)
+      break;
+    cur= cur->next_subsequent_commit;
+    entry= (group_commit_entry *)cur->opaque_pointer;
+    DBUG_ASSERT(entry != NULL);
+  }
+
+  /* Now we need to clear the wakeup_subsequent_commits_running flags. */
+  if (list)
+  {
+    for (;;)
+    {
+      if (list->wakeup_subsequent_commits_running)
+      {
+        mysql_mutex_lock(&list->LOCK_wait_commit);
+        list->wakeup_subsequent_commits_running= false;
+        mysql_mutex_unlock(&list->LOCK_wait_commit);
+      }
+      if (list == last)
+        break;
+      list= list->next_subsequent_commit;
+    }
   }
+
   mysql_mutex_unlock(&LOCK_prepare_ordered);
   DEBUG_SYNC(entry->thd, "commit_after_release_LOCK_prepare_ordered");
 
+  return orig_queue == NULL;
+}
+
+bool
+MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
+{
+  wait_for_commit *wfc;
+  bool is_leader;
+
+  wfc= entry->thd->wait_for_commit_ptr;
+  entry->queued_by_other= false;
+  if (wfc && wfc->waiting_for_commit)
+  {
+    mysql_mutex_lock(&wfc->LOCK_wait_commit);
+    /* Do an extra check here, this time safely under lock. */
+    if (wfc->waiting_for_commit)
+    {
+      wfc->opaque_pointer= entry;
+      do
+      {
+        mysql_cond_wait(&wfc->COND_wait_commit, &wfc->LOCK_wait_commit);
+      } while (wfc->waiting_for_commit);
+      wfc->opaque_pointer= NULL;
+    }
+    mysql_mutex_unlock(&wfc->LOCK_wait_commit);
+  }
+
+  if (entry->queued_by_other)
+    is_leader= false;
+  else
+    is_leader= queue_for_group_commit(entry, wfc);
+
   /*
     The first in the queue handle group commit for all; the others just wait
     to be signalled when group commit is done.
   */
-  if (orig_queue != NULL)
+  if (is_leader)
+    trx_group_commit_leader(entry);
+  else if (!entry->queued_by_other)
     entry->thd->wait_for_wakeup_ready();
   else
-    trx_group_commit_leader(entry);
+  {
+    /*
+      If we were queued by another prior commit, then we are woken up
+      only when the leader has already completed the commit for us.
+      So nothing to do here then.
+    */
+  }
 
   if (!opt_optimize_thread_scheduling)
   {
     /* For the leader, trx_group_commit_leader() already took the lock. */
-    if (orig_queue != NULL)
+    if (!is_leader)
       mysql_mutex_lock(&LOCK_commit_ordered);
 
     DEBUG_SYNC(entry->thd, "commit_loop_entry_commit_ordered");
@@ -6272,7 +6427,10 @@ MYSQL_BIN_LOG::write_transaction_to_binl
 
     if (next)
     {
-      next->thd->signal_wakeup_ready();
+      if (next->queued_by_other)
+        next->thd->wait_for_commit_ptr->wakeup();
+      else
+        next->thd->signal_wakeup_ready();
     }
     else
     {
@@ -6554,7 +6712,12 @@ MYSQL_BIN_LOG::trx_group_commit_leader(g
     */
     next= current->next;
     if (current != leader)                      // Don't wake up ourself
-      current->thd->signal_wakeup_ready();
+    {
+      if (current->queued_by_other)
+        current->thd->wait_for_commit_ptr->wakeup();
+      else
+        current->thd->signal_wakeup_ready();
+    }
     current= next;
   }
   DEBUG_SYNC(leader->thd, "commit_after_group_run_commit_ordered");
@@ -7143,6 +7306,8 @@ int TC_LOG_MMAP::log_and_order(THD *thd,
     mysql_mutex_unlock(&LOCK_prepare_ordered);
   }
 
+  thd->wait_for_prior_commit();
+
   cookie= 0;
   if (xid)
     cookie= log_one_transaction(xid);

=== modified file 'sql/log.h'
--- sql/log.h	2012-12-05 14:05:37 +0000
+++ sql/log.h	2013-01-15 13:02:54 +0000
@@ -45,6 +45,15 @@ class TC_LOG
 
   virtual int open(const char *opt_name)=0;
   virtual void close()=0;
+  /*
+    Transaction coordinator 2-phase commit.
+
+    Must invoke the run_prepare_ordered and run_commit_ordered methods, as
+    described below for these methods.
+
+    In addition, must invoke THD::wait_for_prior_commit(), or equivalent
+    wait, to ensure that one commit waits for another if registered to do so.
+  */
   virtual int log_and_order(THD *thd, my_xid xid, bool all,
                             bool need_prepare_ordered,
                             bool need_commit_ordered) = 0;
@@ -398,6 +407,7 @@ class MYSQL_QUERY_LOG: public MYSQL_LOG
 void binlog_checkpoint_callback(void *cookie);
 
 class binlog_cache_mngr;
+class wait_for_commit;
 class MYSQL_BIN_LOG: public TC_LOG, private MYSQL_LOG
 {
  private:
@@ -447,6 +457,8 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
       group commit, only used when opt_optimize_thread_scheduling is not set.
     */
     bool check_purge;
+    /* Flag used to optimise around wait_for_prior_commit. */
+    bool queued_by_other;
     ulong binlog_id;
   };
 
@@ -551,6 +563,7 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
   void do_checkpoint_request(ulong binlog_id);
   void purge();
   int write_transaction_or_stmt(group_commit_entry *entry);
+  bool queue_for_group_commit(group_commit_entry *entry, wait_for_commit *wfc);
   bool write_transaction_to_binlog_events(group_commit_entry *entry);
   void trx_group_commit_leader(group_commit_entry *leader);
   void mark_xid_done(ulong cookie, bool write_checkpoint);

=== modified file 'sql/log_event.cc'
--- sql/log_event.cc	2012-12-05 14:05:37 +0000
+++ sql/log_event.cc	2013-01-16 10:25:50 +0000
@@ -10042,36 +10042,71 @@ Rows_log_event::get_pk_value(Relay_log_i
         hash_item_t* item= new hash_item_t;
         item->key_len= 0;
  
-        for (i= 0; i < m_table->s->key_info[j].key_parts; i++)
+        /*
+          With in-order commit, we need to be sure that we will not try to run
+          in parallel two transactions that can conflict on any locks on data
+          in the tables. Because that could cause a deadlock, T2 waits for T1
+          to commit while T1 waits for T2 to release its locks after commit.
+
+          With InnoDB, we disable foreign key checks and check that there are
+          no overlapping primary key or unique key values, since InnoDB is
+          based on row locks this (hopefully) guarantees against row lock
+          conflicts.
+
+          But MyISAM for example uses table level locks, so it is entirely
+          possible for T2 to have a lock on the table that prevents T1 from
+          even starting, thus causing a deadlock.
+
+          We handle this partially here. If an engine is non-transactional,
+          then we omit the key values from the hash key (leaving thus the table
+          name). Thus, we will never try to do two MyISAM updates to the same
+          table in parallel (such would in any case usually block one another
+          due to table locks). Updates to different tables at the same time
+          are ok.
+
+          ToDo:
+
+          Note that this is insufficient. One could imagine a transactional
+          engine with table-level locks. Or even page-level locking or lock
+          escalation would be sufficient to cause deadlocks. So just checking
+          for a transactional engine is not enough. One possible solution would
+          be a new flag where an engine could announce a guarantee that
+          parallel updates can not conflict each other on locks if there are
+          no overlapping primary key or unique key values.
+        */
+        if (m_table->file->has_transactions())
         {
-          field_index= m_table->s->key_info[j].key_part[i].field->field_index;
+          for (i= 0; i < m_table->s->key_info[j].key_parts; i++)
+          {
+            field_index= m_table->s->key_info[j].key_part[i].field->field_index;
 
-          if (m_table->field[field_index]->type() == MYSQL_TYPE_VARCHAR)
-          {     
-            uint32 length_byte= ((Field_varstring*)m_table->field[field_index])->length_bytes;
-            if (length_byte == 1) 
-              pack_length= m_table->field[field_index]->ptr[0] + 1;
+            if (m_table->field[field_index]->type() == MYSQL_TYPE_VARCHAR)
+            {     
+              uint32 length_byte= ((Field_varstring*)m_table->field[field_index])->length_bytes;
+              if (length_byte == 1) 
+                pack_length= m_table->field[field_index]->ptr[0] + 1;
+              else  
+                pack_length= uint2korr(m_table->field[field_index]->ptr) + 2;
+            }     
             else  
-              pack_length= uint2korr(m_table->field[field_index]->ptr) + 2;
-          }     
-          else  
-          {     
-            pack_length= m_table->field[field_index]->pack_length();
-          }  
+            {     
+              pack_length= m_table->field[field_index]->pack_length();
+            }  
 
-          if (item->key_len + pack_length >= 1024)
-            break;
+            if (item->key_len + pack_length >= 1024)
+              break;
 
-          memcpy(&item->key[item->key_len], m_table->field[field_index]->ptr, pack_length);
+            memcpy(&item->key[item->key_len], m_table->field[field_index]->ptr, pack_length);
 
-          if (!(m_table->field[field_index]->flags & BINARY_FLAG))
-          {
-            string_tolower(&item->key[item->key_len], pack_length);
+            if (!(m_table->field[field_index]->flags & BINARY_FLAG))
+            {
+              string_tolower(&item->key[item->key_len], pack_length);
+            }
+
+            item->key_len+= pack_length;
+
+            //fprintf(stderr, "%s:%d table_name %s  field_index %d field_value %d left %d pack_length :%d \n", __FILE__, __LINE__, m_table->s->table_name.str, field_index, *(int *)m_table->field[field_index]->ptr, m_rows_end-m_curr_row, pack_length);
           }
- 
-          item->key_len+= pack_length;
-          
-          //fprintf(stderr, "%s:%d table_name %s  field_index %d field_value %d left %d pack_length :%d \n", __FILE__, __LINE__, m_table->s->table_name.str, field_index, *(int *)m_table->field[field_index]->ptr, m_rows_end-m_curr_row, pack_length);
         }
          
         item->key_len+= sprintf(&item->key[item->key_len], "\1%s\1%s", m_table->s->db.str, m_table->s->table_name.str);

=== modified file 'sql/mysqld.cc'
--- sql/mysqld.cc	2012-12-05 14:05:37 +0000
+++ sql/mysqld.cc	2013-01-15 13:37:43 +0000
@@ -742,7 +742,7 @@ PSI_mutex_key key_BINLOG_LOCK_index, key
   key_master_info_sleep_lock,
   key_mutex_slave_reporting_capability_err_lock, key_relay_log_info_data_lock,
   key_relay_log_info_log_space_lock, key_relay_log_info_run_lock,
-  key_relay_log_info_sleep_lock,
+  key_relay_log_info_sleep_lock, key_rli_last_committed_id,
   key_structure_guard_mutex, key_TABLE_SHARE_LOCK_ha_data,
   key_LOCK_error_messages, key_LOG_INFO_lock, key_LOCK_thread_count,
   key_PARTITION_LOCK_auto_inc;
@@ -751,7 +751,7 @@ PSI_mutex_key key_RELAYLOG_LOCK_index;
 PSI_mutex_key key_LOCK_stats,
   key_LOCK_global_user_client_stats, key_LOCK_global_table_stats,
   key_LOCK_global_index_stats,
-  key_LOCK_wakeup_ready;
+  key_LOCK_wakeup_ready, key_LOCK_wait_commit;
 
 PSI_mutex_key key_LOCK_prepare_ordered, key_LOCK_commit_ordered;
 
@@ -795,6 +795,7 @@ static PSI_mutex_info all_server_mutexes
   { &key_LOCK_global_table_stats, "LOCK_global_table_stats", PSI_FLAG_GLOBAL},
   { &key_LOCK_global_index_stats, "LOCK_global_index_stats", PSI_FLAG_GLOBAL},
   { &key_LOCK_wakeup_ready, "THD::LOCK_wakeup_ready", 0},
+  { &key_LOCK_wait_commit, "wait_for_commit::LOCK_wait_commit", 0},
   { &key_LOCK_thd_data, "THD::LOCK_thd_data", 0},
   { &key_LOCK_user_conn, "LOCK_user_conn", PSI_FLAG_GLOBAL},
   { &key_LOCK_uuid_short_generator, "LOCK_uuid_short_generator", PSI_FLAG_GLOBAL},
@@ -807,6 +808,7 @@ static PSI_mutex_info all_server_mutexes
   { &key_relay_log_info_log_space_lock, "Relay_log_info::log_space_lock", 0},
   { &key_relay_log_info_run_lock, "Relay_log_info::run_lock", 0},
   { &key_relay_log_info_sleep_lock, "Relay_log_info::sleep_lock", 0},
+  { &key_rli_last_committed_id, "Relay_log_info::LOCK_last_committed_id", 0},
   { &key_structure_guard_mutex, "Query_cache::structure_guard_mutex", 0},
   { &key_TABLE_SHARE_LOCK_ha_data, "TABLE_SHARE::LOCK_ha_data", 0},
   { &key_LOCK_error_messages, "LOCK_error_messages", PSI_FLAG_GLOBAL},
@@ -851,7 +853,8 @@ PSI_cond_key key_BINLOG_COND_xid_list, k
   key_TABLE_SHARE_cond, key_user_level_lock_cond,
   key_COND_thread_count, key_COND_thread_cache, key_COND_flush_thread_cache,
   key_BINLOG_COND_queue_busy;
-PSI_cond_key key_RELAYLOG_update_cond, key_COND_wakeup_ready;
+PSI_cond_key key_RELAYLOG_update_cond, key_COND_wakeup_ready,
+  key_COND_wait_commit;
 PSI_cond_key key_RELAYLOG_COND_queue_busy;
 PSI_cond_key key_TC_LOG_MMAP_COND_queue_busy;
 
@@ -872,6 +875,7 @@ static PSI_cond_info all_server_conds[]=
   { &key_RELAYLOG_update_cond, "MYSQL_RELAY_LOG::update_cond", 0},
   { &key_RELAYLOG_COND_queue_busy, "MYSQL_RELAY_LOG::COND_queue_busy", 0},
   { &key_COND_wakeup_ready, "THD::COND_wakeup_ready", 0},
+  { &key_COND_wait_commit, "wait_for_commit::COND_wait_commit", 0},
   { &key_COND_cache_status_changed, "Query_cache::COND_cache_status_changed", 0},
   { &key_COND_manager, "COND_manager", PSI_FLAG_GLOBAL},
   { &key_COND_rpl_status, "COND_rpl_status", PSI_FLAG_GLOBAL},

=== modified file 'sql/mysqld.h'
--- sql/mysqld.h	2012-12-05 14:05:37 +0000
+++ sql/mysqld.h	2013-01-15 13:02:54 +0000
@@ -243,14 +243,14 @@ extern PSI_mutex_key key_BINLOG_LOCK_ind
   key_master_info_sleep_lock,
   key_mutex_slave_reporting_capability_err_lock, key_relay_log_info_data_lock,
   key_relay_log_info_log_space_lock, key_relay_log_info_run_lock,
-  key_relay_log_info_sleep_lock,
+  key_relay_log_info_sleep_lock, key_rli_last_committed_id,
   key_structure_guard_mutex, key_TABLE_SHARE_LOCK_ha_data,
   key_LOCK_error_messages, key_LOCK_thread_count, key_PARTITION_LOCK_auto_inc;
 extern PSI_mutex_key key_RELAYLOG_LOCK_index;
 
 extern PSI_mutex_key key_LOCK_stats,
   key_LOCK_global_user_client_stats, key_LOCK_global_table_stats,
-  key_LOCK_global_index_stats, key_LOCK_wakeup_ready;
+  key_LOCK_global_index_stats, key_LOCK_wakeup_ready, key_LOCK_wait_commit;
 
 extern PSI_rwlock_key key_rwlock_LOCK_grant, key_rwlock_LOCK_logger,
   key_rwlock_LOCK_sys_init_connect, key_rwlock_LOCK_sys_init_slave,
@@ -272,7 +272,8 @@ extern PSI_cond_key key_BINLOG_COND_xid_
   key_relay_log_info_sleep_cond,
   key_TABLE_SHARE_cond, key_user_level_lock_cond,
   key_COND_thread_count, key_COND_thread_cache, key_COND_flush_thread_cache;
-extern PSI_cond_key key_RELAYLOG_update_cond, key_COND_wakeup_ready;
+extern PSI_cond_key key_RELAYLOG_update_cond, key_COND_wakeup_ready,
+  key_COND_wait_commit;
 extern PSI_cond_key key_RELAYLOG_COND_queue_busy;
 extern PSI_cond_key key_TC_LOG_MMAP_COND_queue_busy;
 

=== modified file 'sql/rpl_rli.cc'
--- sql/rpl_rli.cc	2012-12-05 14:05:37 +0000
+++ sql/rpl_rli.cc	2013-01-15 13:02:54 +0000
@@ -43,6 +43,7 @@ Relay_log_info::Relay_log_info(bool is_s
    sync_counter(0), is_relay_log_recovery(is_slave_recovery),
    save_temporary_tables(0), cur_log_old_open_count(0), group_relay_log_pos(0), 
    event_relay_log_pos(0),
+   last_trans_id(0), last_trans(0), last_committed_id(0),
 #if HAVE_valgrind
    is_fake(FALSE),
 #endif
@@ -78,6 +79,8 @@ Relay_log_info::Relay_log_info(bool is_s
   mysql_mutex_init(key_relay_log_info_log_space_lock,
                    &log_space_lock, MY_MUTEX_INIT_FAST);
   mysql_mutex_init(key_relay_log_info_sleep_lock, &sleep_lock, MY_MUTEX_INIT_FAST);
+  mysql_mutex_init(key_rli_last_committed_id, &LOCK_last_committed_id,
+                   MY_MUTEX_INIT_SLOW);
   mysql_cond_init(key_relay_log_info_data_cond, &data_cond, NULL);
   mysql_cond_init(key_relay_log_info_start_cond, &start_cond, NULL);
   mysql_cond_init(key_relay_log_info_stop_cond, &stop_cond, NULL);
@@ -102,6 +105,7 @@ Relay_log_info::~Relay_log_info()
   mysql_mutex_destroy(&data_lock);
   mysql_mutex_destroy(&log_space_lock);
   mysql_mutex_destroy(&sleep_lock);
+  mysql_mutex_destroy(&LOCK_last_committed_id);
   mysql_cond_destroy(&data_cond);
   mysql_cond_destroy(&start_cond);
   mysql_cond_destroy(&stop_cond);

=== modified file 'sql/rpl_rli.h'
--- sql/rpl_rli.h	2012-12-05 14:05:37 +0000
+++ sql/rpl_rli.h	2013-01-15 13:37:43 +0000
@@ -106,6 +106,25 @@ class transaction_st{
   my_off_t relay_log_pos;
   trans_pos_t *trans_pos;
 
+  /* This is used to keep transaction commit order. */
+  wait_for_commit commit_orderer;
+  /*
+    This is the ID (Relay_log_info::last_trans_id) of the previous transaction,
+    that we want to wait for before committing ourselves (if ordered commits
+    are enforced).
+  */
+  uint64 wait_commit_id;
+  /*
+    This is the transaction whose commit we want to wait for.
+    Only valid if Relay_log_info::last_committed_id < wait_commit_id.
+  */
+  transaction_st *wait_commit_trans;
+  /*
+    This is our own transaction id, which we should update
+    Relay_log_info::last_committed_id to once we commit.
+  */
+  uint64 own_commit_id;
+
   transaction_st();
 
   ~transaction_st();
@@ -123,7 +142,7 @@ class Transfer_worker
   int start();
   int stop();
   int wait_for_stopped();
-  int push_trans(transaction_st *trans);
+  int push_trans(Relay_log_info *rli, transaction_st *trans);
   int pop_trans(int batch_trans_n);
   int remove_trans(transaction_st *trans);
   bool check_trans_conflict(transaction_st *trans);
@@ -294,6 +313,20 @@ class Relay_log_info : public Slave_repo
   int push_back_trans_pos(transaction_st *trans);
   int rollback_trans_pos(transaction_st *trans);
   int pop_front_trans_pos();
+
+  /* Running counter for assigning IDs to event groups/transactions. */
+  uint64 last_trans_id;
+  /* The transaction_st corresponding to last_trans_id. */
+  transaction_st *last_trans;
+  /*
+    The ID of the last transaction/event group that was committed/applied.
+    This is used to decide if the next transaction should wait for the
+    previous one to commit (to avoid trying to wait for a commit that already
+    took place).
+  */
+  uint64 last_committed_id;
+  /* Mutex protecting access to last_committed_id. */
+  mysql_mutex_t LOCK_last_committed_id;
   /*Transfer end*/
 
 

=== modified file 'sql/slave.cc'
--- sql/slave.cc	2013-01-08 14:12:14 +0000
+++ sql/slave.cc	2013-01-15 13:02:54 +0000
@@ -2857,7 +2857,64 @@ int execute_single_transaction(Relay_log
 
 int Transfer_worker::execute_transaction(transaction_st *trans)
 {
-  return execute_single_transaction(dummy_rli, trans);
+  int res;
+
+  mysql_mutex_lock(&rli->LOCK_last_committed_id);
+  /*
+    Register us to wait for the previous commit, unless that commit is
+    already finished.
+  */
+  if (trans->wait_commit_id > rli->last_committed_id)
+  {
+    trans->commit_orderer.register_wait_for_prior_commit
+      (&trans->wait_commit_trans->commit_orderer);
+  }
+  mysql_mutex_unlock(&rli->LOCK_last_committed_id);
+
+  DBUG_ASSERT(!thd->wait_for_commit_ptr);
+  thd->wait_for_commit_ptr= &trans->commit_orderer;
+
+  res= execute_single_transaction(dummy_rli, trans);
+
+  /*
+    It is important to not leave us dangling in the wait-for list of another
+    THD. Best would be to ensure that we never register to wait without
+    actually waiting. But it's cheap, and probably more robust, to do an extra
+    check here and remove our wait registration if we somehow ended up never
+    waiting because of error condition or something.
+
+    ToDo: We need to *wait* here, not unregister. Because we must not wake
+    up following transactions until all prior transactions have completed.
+
+    If we do not want to wait, then alternatively we must put the
+    transaction_st * trans into some pending list, where it can be "woken up"
+    asynchroneously when the prior transaction _does_ commit.
+  */
+  trans->commit_orderer.unregister_wait_for_prior_commit();
+
+  thd->wait_for_commit_ptr= NULL;
+
+  /*
+    Register our commit so that subsequent transactions/event groups will know
+    not to register to wait for us any more.
+
+    We can race here with the next transactions, but that is fine, as long as
+    we check that we do not decrease last_committed_id. If this commit is done,
+    then any prior commits will also have been done and also no longer need
+    waiting for.
+  */
+  mysql_mutex_lock(&rli->LOCK_last_committed_id);
+  if (rli->last_committed_id < trans->own_commit_id)
+    rli->last_committed_id= trans->own_commit_id;
+  mysql_mutex_unlock(&rli->LOCK_last_committed_id);
+
+  /*
+    Now that we have marked in rli->last_committed_id that we have committed,
+    no more waiter can register. So wake up any pending one last time.
+  */
+  trans->commit_orderer.wakeup_subsequent_commits();
+
+  return res;
 }
 
 int transfer_event_types[] = {TABLE_MAP_EVENT, WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT, QUERY_EVENT, XID_EVENT};
@@ -3038,8 +3095,11 @@ int Transfer_worker::wait_for_stopped()
 }
 
 /* return -1 means the worker is full. ok is 0 */
-int Transfer_worker::push_trans(transaction_st *trans)
+int Transfer_worker::push_trans(Relay_log_info *rli, transaction_st *trans)
 {
+
+  uint64 prev_trans_id, this_trans_id;
+
   rw_wrlock(&trans_list_lock);
 
   if (waiting_trans_number == 0)
@@ -3056,6 +3116,16 @@ int Transfer_worker::push_trans(transact
     list_end= (list_end + 1) % worker_size;
   }
 
+  prev_trans_id= rli->last_trans_id;
+  this_trans_id= prev_trans_id + 1;
+  rli->last_trans_id= this_trans_id;
+
+  trans->own_commit_id= this_trans_id;
+  trans->wait_commit_id= prev_trans_id;
+  trans->wait_commit_trans= rli->last_trans;
+
+  rli->last_trans= trans;
+
   pk_hash_plus(trans);
 
   trans_list[list_end]= trans;
@@ -3281,7 +3351,7 @@ int dispatch_transaction(Relay_log_info
     goto retry;
   }
 
-  if (rli->workers[trans->worker_id]->push_trans(trans) != 0)
+  if (rli->workers[trans->worker_id]->push_trans(rli, trans) != 0)
   {
     rli->rollback_trans_pos(trans);
     my_sleep(1000);

=== modified file 'sql/sql_class.cc'
--- sql/sql_class.cc	2012-12-05 14:05:37 +0000
+++ sql/sql_class.cc	2013-01-15 13:02:54 +0000
@@ -574,6 +574,17 @@ void thd_set_ha_data(THD *thd, const str
 }
 
 
+/**
+  Allow storage engine to wakeup commits waiting in THD::wait_for_prior_commit.
+  @see thd_wakeup_subsequent_commits() definition in plugin.h
+*/
+extern "C"
+void thd_wakeup_subsequent_commits(THD *thd)
+{
+  thd->wakeup_subsequent_commits();
+}
+
+
 extern "C"
 long long thd_test_options(const THD *thd, long long test_options)
 {
@@ -754,6 +765,7 @@ THD::THD()
 #if defined(ENABLED_DEBUG_SYNC)
    debug_sync_control(0),
 #endif /* defined(ENABLED_DEBUG_SYNC) */
+   wait_for_commit_ptr(0),
    main_warning_info(0, false)
 {
   ulong tmp;
@@ -5505,6 +5517,202 @@ THD::signal_wakeup_ready()
 }
 
 
+wait_for_commit::wait_for_commit()
+  : subsequent_commits_list(0), next_subsequent_commit(0), waitee(0),
+    opaque_pointer(0),
+    waiting_for_commit(false), wakeup_subsequent_commits_running(false)
+{
+  mysql_mutex_init(key_LOCK_wait_commit, &LOCK_wait_commit, MY_MUTEX_INIT_FAST);
+  mysql_cond_init(key_COND_wait_commit, &COND_wait_commit, 0);
+}
+
+
+void
+wait_for_commit::wakeup()
+{
+  /*
+    We signal each waiter on their own condition and mutex (rather than using
+    pthread_cond_broadcast() or something like that).
+
+    Otherwise we would need to somehow ensure that they were done
+    waking up before we could allow this THD to be destroyed, which would
+    be annoying and unnecessary.
+  */
+  mysql_mutex_lock(&LOCK_wait_commit);
+  waiting_for_commit= false;
+  mysql_cond_signal(&COND_wait_commit);
+  mysql_mutex_unlock(&LOCK_wait_commit);
+}
+
+
+/*
+  Register that the next commit of this THD should wait to complete until
+  commit in another THD (the waitee) has completed.
+
+  The wait may occur explicitly, with the waiter sitting in
+  wait_for_prior_commit() until the waitee calls wakeup_subsequent_commits().
+
+  Alternatively, the TC (eg. binlog) may do the commits of both waitee and
+  waiter at once during group commit, resolving both of them in the right
+  order.
+
+  Only one waitee can be registered for a waiter; it must be removed by
+  wait_for_prior_commit() or unregister_wait_for_prior_commit() before a new
+  one is registered. But it is ok for several waiters to register a wait for
+  the same waitee. It is also permissible for one THD to be both a waiter and
+  a waitee at the same time.
+*/
+void
+wait_for_commit::register_wait_for_prior_commit(wait_for_commit *waitee)
+{
+  waiting_for_commit= true;
+  DBUG_ASSERT(!this->waitee /* No prior registration allowed */);
+  this->waitee= waitee;
+
+  mysql_mutex_lock(&waitee->LOCK_wait_commit);
+  /*
+    If waitee is in the middle of wakeup, then there is nothing to wait for,
+    so we need not register. This is necessary to avoid a race in unregister,
+    see comments on wakeup_subsequent_commits2() for details.
+  */
+  if (waitee->wakeup_subsequent_commits_running)
+    waiting_for_commit= false;
+  else
+  {
+    this->next_subsequent_commit= waitee->subsequent_commits_list;
+    waitee->subsequent_commits_list= this;
+  }
+  mysql_mutex_unlock(&waitee->LOCK_wait_commit);
+}
+
+
+/*
+  Wait for commit of another transaction to complete, as already registered
+  with register_wait_for_prior_commit(). If the commit already completed,
+  returns immediately.
+*/
+void
+wait_for_commit::wait_for_prior_commit2()
+{
+  mysql_mutex_lock(&LOCK_wait_commit);
+  while (waiting_for_commit)
+    mysql_cond_wait(&COND_wait_commit, &LOCK_wait_commit);
+  mysql_mutex_unlock(&LOCK_wait_commit);
+  waitee= NULL;
+}
+
+
+/*
+  Wakeup anyone waiting for us to have committed.
+
+  Note about locking:
+
+  We have a potential race or deadlock between wakeup_subsequent_commits() in
+  the waitee and unregister_wait_for_prior_commit() in the waiter.
+
+  Both waiter and waitee needs to take their own lock before it is safe to take
+  a lock on the other party - else the other party might disappear and invalid
+  memory data could be accessed. But if we take the two locks in different
+  order, we may end up in a deadlock.
+
+  The waiter needs to lock the waitee to delete itself from the list in
+  unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not
+  hold its own lock while locking waiters, lest we deadlock.
+
+  So we need to prevent unregister_wait_for_prior_commit() running while wakeup
+  is in progress - otherwise the unregister could complete before the wakeup,
+  leading to incorrect spurious wakeup or accessing invalid memory.
+
+  However, if we are in the middle of running wakeup_subsequent_commits(), then
+  there is no need for unregister_wait_for_prior_commit() in the first place -
+  the waiter can just do a normal wait_for_prior_commit(), as it will be
+  immediately woken up.
+
+  So the solution to the potential race/deadlock is to set a flag in the waitee
+  that wakeup_subsequent_commits() is in progress. When this flag is set,
+  unregister_wait_for_prior_commit() becomes just wait_for_prior_commit().
+
+  Then also register_wait_for_prior_commit() needs to check if
+  wakeup_subsequent_commits() is running, and skip the registration if
+  so. This is needed in case a new waiter manages to register itself and
+  immediately try to unregister while wakeup_subsequent_commits() is
+  running. Else the new waiter would also wait rather than unregister, but it
+  would not be woken up until next wakeup, which could be potentially much
+  later than necessary.
+*/
+void
+wait_for_commit::wakeup_subsequent_commits2()
+{
+  wait_for_commit *waiter;
+
+  mysql_mutex_lock(&LOCK_wait_commit);
+  wakeup_subsequent_commits_running= true;
+  waiter= subsequent_commits_list;
+  subsequent_commits_list= NULL;
+  mysql_mutex_unlock(&LOCK_wait_commit);
+
+  while (waiter)
+  {
+    /*
+      Important: we must grab the next pointer before waking up the waiter;
+      once the wakeup is done, the field could be invalidated at any time.
+    */
+    wait_for_commit *next= waiter->next_subsequent_commit;
+    waiter->wakeup();
+    waiter= next;
+  }
+
+  mysql_mutex_lock(&LOCK_wait_commit);
+  wakeup_subsequent_commits_running= false;
+  mysql_mutex_unlock(&LOCK_wait_commit);
+}
+
+
+/* Cancel a previously registered wait for another THD to commit before us. */
+void
+wait_for_commit::unregister_wait_for_prior_commit2()
+{
+  mysql_mutex_lock(&LOCK_wait_commit);
+  if (waiting_for_commit)
+  {
+    wait_for_commit *loc_waitee= this->waitee;
+    wait_for_commit **next_ptr_ptr, *cur;
+    mysql_mutex_lock(&loc_waitee->LOCK_wait_commit);
+    if (loc_waitee->wakeup_subsequent_commits_running)
+    {
+      /*
+        When a wakeup is running, we cannot safely remove ourselves from the
+        list without corrupting it. Instead we can just wait, as wakeup is
+        already in progress and will thus be immediate.
+
+        See comments on wakeup_subsequent_commits2() for more details.
+      */
+      mysql_mutex_unlock(&loc_waitee->LOCK_wait_commit);
+      while (waiting_for_commit)
+        mysql_cond_wait(&COND_wait_commit, &LOCK_wait_commit);
+    }
+    else
+    {
+      /* Remove ourselves from the list in the waitee. */
+      next_ptr_ptr= &loc_waitee->subsequent_commits_list;
+      while ((cur= *next_ptr_ptr) != NULL)
+      {
+        if (cur == this)
+        {
+          *next_ptr_ptr= this->next_subsequent_commit;
+          break;
+        }
+        next_ptr_ptr= &cur->next_subsequent_commit;
+      }
+      waiting_for_commit= false;
+      mysql_mutex_unlock(&loc_waitee->LOCK_wait_commit);
+    }
+  }
+  mysql_mutex_unlock(&LOCK_wait_commit);
+  this->waitee= NULL;
+}
+
+
 bool Discrete_intervals_list::append(ulonglong start, ulonglong val,
                                  ulonglong incr)
 {

=== modified file 'sql/sql_class.h'
--- sql/sql_class.h	2012-09-22 14:11:40 +0000
+++ sql/sql_class.h	2013-01-15 13:02:54 +0000
@@ -1518,6 +1518,115 @@ class Global_read_lock
 };
 
 
+/*
+  Class to facilitate the commit of one transactions waiting for the commit of
+  another transaction to complete first.
+
+  This is used during (parallel) replication, to allow different transactions
+  to be applied in parallel, but still commit in order.
+
+  The transaction that wants to wait for a prior commit must first register
+  to wait with register_wait_for_prior_commit(waitee). Such registration
+  must be done holding the waitee->LOCK_wait_commit, to prevent the other
+  THD from disappearing during the registration.
+
+  Then during commit, if a THD is registered to wait, it will call
+  wait_for_prior_commit() as part of ha_commit_trans(). If no wait is
+  registered, or if the waitee for has already completed commit, then
+  wait_for_prior_commit() returns immediately.
+
+  And when a THD that may be waited for has completed commit (more precisely
+  commit_ordered()), then it must call wakeup_subsequent_commits() to wake
+  up any waiters. Note that this must be done at a point that is guaranteed
+  to be later than any waiters registering themselves. It is safe to call
+  wakeup_subsequent_commits() multiple times, as waiters are removed from
+  registration as part of the wakeup.
+
+  The reason for separate register and wait calls is that this allows to
+  register the wait early, at a point where the waited-for THD is known to
+  exist. And then the actual wait can be done much later, where the
+  waited-for THD may have been long gone. By registering early, the waitee
+  can signal before disappearing.
+*/
+struct wait_for_commit
+{
+  /*
+    The LOCK_wait_commit protects the fields subsequent_commits_list and
+    wakeup_subsequent_commits_running (for a waitee), and the flag
+    waiting_for_commit and associated COND_wait_commit (for a waiter).
+  */
+  mysql_mutex_t LOCK_wait_commit;
+  mysql_cond_t COND_wait_commit;
+  /* List of threads that did register_wait_for_prior_commit() on us. */
+  wait_for_commit *subsequent_commits_list;
+  /* Link field for entries in subsequent_commits_list. */
+  wait_for_commit *next_subsequent_commit;
+  /* Our waitee, if we did register_wait_for_prior_commit(), else NULL. */
+  wait_for_commit *waitee;
+  /*
+    Generic pointer for use by the transaction coordinator to optimise the
+    waiting for improved group commit.
+
+    Currently used by binlog TC to signal that a waiter is ready to commit, so
+    that the waitee can grab it and group commit it directly. It is free to be
+    used by another transaction coordinator for similar purposes.
+  */
+  void *opaque_pointer;
+  /*
+    The waiting_for_commit flag is cleared when a waiter has been woken
+    up. The COND_wait_commit condition is signalled when this has been
+    cleared.
+  */
+  bool waiting_for_commit;
+  /*
+    Flag set when wakeup_subsequent_commits_running() is active, see commonts
+    on that function for details.
+  */
+  bool wakeup_subsequent_commits_running;
+
+  void register_wait_for_prior_commit(wait_for_commit *waitee);
+  void wait_for_prior_commit()
+  {
+    /*
+      Quick inline check, to avoid function call and locking in the common case
+      where no wakeup is registered, or a registered wait was already signalled.
+    */
+    if (waiting_for_commit)
+      wait_for_prior_commit2();
+  }
+  void wakeup_subsequent_commits()
+  {
+    /*
+      Do the check inline, so only the wakeup case takes the cost of a function
+      call for every commmit.
+
+      Note that the check is done without locking. It is the responsibility of
+      the user of the wakeup facility to ensure that no waiters can register
+      themselves after the last call to wakeup_subsequent_commits().
+
+      This avoids having to take another lock for every commit, which would be
+      pointless anyway - even if we check under lock, there is nothing to
+      prevent a waiter from arriving just after releasing the lock.
+    */
+    if (subsequent_commits_list)
+      wakeup_subsequent_commits2();
+  }
+  void unregister_wait_for_prior_commit()
+  {
+    if (waiting_for_commit)
+      unregister_wait_for_prior_commit2();
+  }
+
+  void wakeup();
+
+  void wait_for_prior_commit2();
+  void wakeup_subsequent_commits2();
+  void unregister_wait_for_prior_commit2();
+
+  wait_for_commit();
+};
+
+
 extern "C" void my_message_sql(uint error, const char *str, myf MyFlags);
 
 /**
@@ -3095,6 +3204,19 @@ class THD :public Statement,
   void wait_for_wakeup_ready();
   /* Wake this thread up from wait_for_wakeup_ready(). */
   void signal_wakeup_ready();
+
+  wait_for_commit *wait_for_commit_ptr;
+  void wait_for_prior_commit()
+  {
+    if (wait_for_commit_ptr)
+      wait_for_commit_ptr->wait_for_prior_commit();
+  }
+  void wakeup_subsequent_commits()
+  {
+    if (wait_for_commit_ptr)
+      wait_for_commit_ptr->wakeup_subsequent_commits();
+  }
+
 private:
 
   /** The current internal error handler for this thread, or NULL. */

=== modified file 'storage/innobase/handler/ha_innodb.cc'
--- storage/innobase/handler/ha_innodb.cc	2012-09-22 14:11:40 +0000
+++ storage/innobase/handler/ha_innodb.cc	2013-01-15 13:02:54 +0000
@@ -2902,6 +2902,11 @@ innobase_commit(
 		/* We were instructed to commit the whole transaction, or
 		this is an SQL statement end and autocommit is on */
 
+		/* At this point commit order is fixed and transaction is
+		visible to others. So we can wakeup other commits waiting for
+		this one, to allow then to group commit with us. */
+		thd_wakeup_subsequent_commits(thd);
+
 		/* We did the first part already in innobase_commit_ordered(),
 		Now finish by doing a write + flush of logs. */
 		trx_commit_complete_for_mysql(trx);

=== modified file 'storage/xtradb/handler/ha_innodb.cc'
--- storage/xtradb/handler/ha_innodb.cc	2012-09-22 14:11:40 +0000
+++ storage/xtradb/handler/ha_innodb.cc	2013-01-15 13:02:54 +0000
@@ -3414,6 +3414,11 @@ innobase_commit(
 		/* We were instructed to commit the whole transaction, or
 		this is an SQL statement end and autocommit is on */
 
+		/* At this point commit order is fixed and transaction is
+		visible to others. So we can wakeup other commits waiting for
+		this one, to allow then to group commit with us. */
+		thd_wakeup_subsequent_commits(thd);
+
 		/* We did the first part already in innobase_commit_ordered(),
 		Now finish by doing a write + flush of logs. */
 		trx_commit_complete_for_mysql(trx);


References