maria-developers team mailing list archive

Thread
Date

Re: 答复: in-order commit

To: 丁奇 <dingqi.lxb@xxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 11 Jan 2013 11:44:49 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <D264FECF3AFEE04D96AA18E5760D9C95022293@CNHZ-EXMAIL-08.ali.com> ("丁奇"'s message of "Fri, 11 Jan 2013 03:46:04 +0000")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

丁奇 <dingqi.lxb@xxxxxxxxxx> writes:

> Hi, Kristian
>     Ok. I have got the information from JIRA.
>

>    I find you control the commit order inside the user thread.
>
> Will it be easier to let Trans_worker thread hold this logic?

Yes, I think you are right. Of course, the user thread is the one that knows
the ordering, but the logic for waiting needs to be in the Trans_worker
thread. In fact this is a bug in my first patch: Transaction T3 could wait for
the THD of worker thread 1 which has both T1 and T2 queued; then it will wake
up too early, when T1 commits rather than when T2 does.

I will try to implement the new idea today.

> After they have done the execution of one transaction, "register the transaction and wait" if there are transactions from other workers should be commited ahead.
> After commit in one worker, wake up another worker, the worker who wait for  the next "head of commitee" should be woken up.

Right, I'll need to look into this a bit deeper. Actually, in my patch the
actual wait and wakeup happens inside ha_commit_trans(), and there is a reason
for this. Because eventually I want to do it inside tc_log->log_and_order(),
which is called from ha_commit_trans().

Here is how a commit happens:

  InnoDB prepare step
    fsync() InnoDB redo log                  (*A)
  TC_LOG_BINLOG::log_and_order
    Write transaction to binlog
    fsync() binlog                           (*B)
    InnoDB commit_ordered()                  (*C)
      Write commit record to InnoDB redo log
  InnoDB commit step

The steps (*A) and (*B) are slow, typically around 1-10 milliseconds depending
on disk system. So we need many threads to commit in parallel and reach points
(*A) and (*B) at the same time, so we only need to do the fsync() once for
many threads. This is group commit.

Thus for in-order parallel replication, we must not do the wait for the
previous commit before the (*B) step. Because if we do, then it becomes
impossible for two transactions to be at point (*B) at the same time, and
group commit is impossible.

On the other hand, point (*C) is where the commit order is determined. So if
we do the wait after point (*C), then we cannot enforce that T1 commits before
T2.

So therefore, the wait must happen exactly around point (B) and (C), inside
TC_LOG_BINLOG::log_and_order(). That is why I invented all the
register_wait_for_prior_commit() and so on: so that log_and_order() has
somewhere to look for exactly who is waiting for who. Then if T2 is waiting
for T1 to commit, we can do steps (*B) and (*C) for both of them together,
achiving both group commit and in-order parallel replication.

Anyway, I just wanted to mention this, I know it will be difficult to
understand fully from just this description. This is something that I have
been planning to have for years, but I still need to show some real code that
actually works. If I manage that, hopefully things will be clearer.

(If not - then I need to think again ;-)

Thanks,

 - Kristian.

Follow ups

Re: 答复: in-order commit
From: Kristian Nielsen, 2013-01-11