maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #05038
Re: 答复: in-order commit
丁奇 <dingqi.lxb@xxxxxxxxxx> writes:
> Hi, Kristian
> Ok. I have got the information from JIRA.
>
> I find you control the commit order inside the user thread.
>
> Will it be easier to let Trans_worker thread hold this logic?
Yes, I think you are right. Of course, the user thread is the one that knows
the ordering, but the logic for waiting needs to be in the Trans_worker
thread. In fact this is a bug in my first patch: Transaction T3 could wait for
the THD of worker thread 1 which has both T1 and T2 queued; then it will wake
up too early, when T1 commits rather than when T2 does.
I will try to implement the new idea today.
> After they have done the execution of one transaction, "register the transaction and wait" if there are transactions from other workers should be commited ahead.
> After commit in one worker, wake up another worker, the worker who wait for the next "head of commitee" should be woken up.
Right, I'll need to look into this a bit deeper. Actually, in my patch the
actual wait and wakeup happens inside ha_commit_trans(), and there is a reason
for this. Because eventually I want to do it inside tc_log->log_and_order(),
which is called from ha_commit_trans().
Here is how a commit happens:
InnoDB prepare step
fsync() InnoDB redo log (*A)
TC_LOG_BINLOG::log_and_order
Write transaction to binlog
fsync() binlog (*B)
InnoDB commit_ordered() (*C)
Write commit record to InnoDB redo log
InnoDB commit step
The steps (*A) and (*B) are slow, typically around 1-10 milliseconds depending
on disk system. So we need many threads to commit in parallel and reach points
(*A) and (*B) at the same time, so we only need to do the fsync() once for
many threads. This is group commit.
Thus for in-order parallel replication, we must not do the wait for the
previous commit before the (*B) step. Because if we do, then it becomes
impossible for two transactions to be at point (*B) at the same time, and
group commit is impossible.
On the other hand, point (*C) is where the commit order is determined. So if
we do the wait after point (*C), then we cannot enforce that T1 commits before
T2.
So therefore, the wait must happen exactly around point (B) and (C), inside
TC_LOG_BINLOG::log_and_order(). That is why I invented all the
register_wait_for_prior_commit() and so on: so that log_and_order() has
somewhere to look for exactly who is waiting for who. Then if T2 is waiting
for T1 to commit, we can do steps (*B) and (*C) for both of them together,
achiving both group commit and in-order parallel replication.
Anyway, I just wanted to mention this, I know it will be difficult to
understand fully from just this description. This is something that I have
been planning to have for years, but I still need to show some real code that
actually works. If I manage that, hopefully things will be clearer.
(If not - then I need to think again ;-)
Thanks,
- Kristian.
Follow ups