← Back to team overview

maria-developers team mailing list archive

Re: Architecture review of MWL#132 Transaction coordinator plugin

 

Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Now, WL#132 - Transaction coordinator plugin

> Wouldn't it be simpler to create only group_log_xid() interface, no
> log_and_order() or log_xid() ? The tc plugin gets the list in
> group_log_xid() - it can reorder the list any way it wants, call
> prepare_ordered() and commit_ordered() as needed and so on.
> In this interpretation, group_log_xid() can meet all the use cases.
> And there's no need to create a multitude of methods that one
> needs to get familiar with before implementing a TC plugin.

I do not see how this would work. The group_log_xid() interface as specified
here does not allow the TC to reorder transactions, on the contrary the commit
order has already been decided by the ordering of transactions in the passed
list.

But there is no need for multiple interfaces, just one: the log_and_order()
interface. That is my main idea with MWL#132: to generalise the TC interface
so that something like Galera is able to change commit order as it needs.

So there is only one plugin API, log_and_order(). The other interfaces
(log_xid() and group_log_xid()) are not plugin APIs, they are just helper
classes that one can use to implement some simpler types of TC plugins. I
thought they could be useful to provide somehow, but maybe it just confuses
the issue. Instead, they could just be examples, or maybe only something we
use internally in mysqld to implement TC_LOG_MMAP and TC_LOG_BINLOG.

(And as you suggest, maybe we do not need log_xid() at all, we could just
rewrite TC_LOG_MMAP to use group_log_xid()).

Does that make my intensions clearer?

----

So, to elaborate on the log_and_order() interface:

I think it is a nice generalisation. It is easy to implement group_log_xid()
in the log_and_order() framework, it is essentially the algorithm from
MWL#116. But log_and_order() is more general, since it allows to change commit
order, this is not possible in group_log_xid(), since it is called only when
commit order has already been decided.

This is how I understand Galera works:

Galera first runs transactions in complete isolation on each node, buffering
row events just like the binlog.

Only during commit is the transaction replicated to other nodes. A global
transaction ID is assigned to the transaction; this ID is a monotonic sequence
which thus specifies the commit ordering relative to all other transactions in
the cluster. The events for the transaction are then shipped to all other
nodes.

A seperate thread (or threads) applies transaction events received from other
nodes in global transaction ID order (similar to the slave SQL thread). The
commit of a local transaction is delayed until all other transactions with
earlier global transaction ID have been applied.

Galera uses optimistic concurrency control, assuming transactions can run
independently, and aborting one if there turns out to be a conflict after all.
They use the certification based replication method to handle such
conflicts. As I understand it, the idea is to have each node check for
conflicts between transactions individually, but using a deterministic
algorithm that ensures that all nodes will make the same decision about which
transaction to rollback and which to keep. (Galera keeps track of primary key
values of all modified rows for this purpose).

(I hope I got this right, we should ask the Galera people for more details).

So this is where log_and_order() comes in. Galera would install a TC plugin,
and would receive a call into log_and_order() when a transaction commits. It
can then replicate the transaction across the cluster and assign global
transaction ID. It can then synchronise among threads to invoke
prepare_ordered() in correct global transaction ID order, and afterwards
commit_ordered() in the same order. Then when it returns from log_and_order(),
the commit order has been correctly decided (or it can roll back a conflicting
transaction by returning error from log_and_order().

So it seems to be a good fit with Galera (though it still has to be shown to
work in practice).

Something like group_log_xid(list_of_transactions) does not really work here I
think. Galera may need to reorder a local transaction with another transaction
that has not even started yet when group_log_xid() is called, so even allowing
to reorder the passed-in list seems insufficient.

Also the old log_xid() interface seems insufficient, as it provides no way for
Galera to control the order that transactions commit in after returning from
log_xid(). Hm, maybe it could wait for unlog() from transaction 1 before
returning from log_xid() from transaction 2, but that seems not optimal (and
would prevent any kind of group commit).

> I still see no real value in keeping or supporting log_xid() interface.
>
> I think we can only implement one interface - group_log_xid() - and
> that's enough.

The central idea in group_log_xid() is the mechanism whereby transactions can
queue up while TC is busy making previous transactions durable. So when TC
becomes ready, we have a whole list of waiting transactions that can share the
next fsync().

This is really an implementation of group commit, not a fully general
interface. But it is general enough that it could probably be useful for other
binlog-like implementations also. Same for log_xid() more or less.

But I agree there is no need to have them as interfaces in the server. They
can just serve as examples on how things can be implemented.

>> A TC based on this interface overrides group_log_xid() and
>> xid_log_after() instead of log_and_order(), and again does not need to
>> deal with any {prepare,commit}_ordered().
>
> Why do you need xid_log_after here ?

I think the original motivation was that group_log_xid() handles many
transactions in one thread, so it cannot call my_error() on each transaction
individually. After all, it is possible for some transactions to fail while
others succeed.

So xid_log_after() runs in each individual thread once group_log_xid() is
done, and can call my_error() for any deferred error.

But it seems in any case appropriate to have a part of TC logging that runs in
parallel, giving the TC the opportunity to reduce the amount of work done in
the critical code path under the global LOCK_group_commit mutex. Just like the
serialised prepare_ordered() and commit_ordered() calls have parallel
counterparts prepare() and commit().

>>     If need_prepare_ordered or need_commit_ordered is passed as FALSE,
>>     then the corresponding call need not be done. It is safe to do it
>>     anyway, however omitting it avoids the need to take a global
>>     mutex.
>
> Why would this ever be needed ?
> (I mean need_prepare_ordered or need_commit_ordered being FALSE)

This is for engines that do not install prepare_ordered() and/or
commit_ordered() methods (or that disables them due to user configuration, in
case it provides better performance when consistent commit order is not
needed).

If these calls are not needed, then log_and_order() can take less locks,
avoiding LOCK_prepare_ordered and/or LOCK_commit_ordered.

Well, we already discussed changing LOCK_prepare_ordered to be the queue lock,
and removing LOCK_commit_ordered completely. That may leave nothing to be
saved, so I would just remove this.

(The only remaining case I can come up with is TC_LOG_MMAP; unless both
prepare_ordered() and commit_ordered() are installed, it need not do any
queueing at all, as there is no concept of commit order inside it. But this is
somewhat of a corner case).

>> In current MariaDB, we have two different TC implementations (as well
>> as a "dummy" empty implementation that I do not know if is used).
>
> The code in mysqld.cc is
>
>   tc_log= (total_ha_2pc > 1 ? (opt_bin_log  ?
>                                (TC_LOG *) &mysql_bin_log :
>                                (TC_LOG *) &tc_log_mmap) :
>            (TC_LOG *) &tc_log_dummy);
>
> so, tc_log_dummy is used when there's at most one xa-capable engine.
> But MySQL does not use 2pc for a transaction unless it has at least two
> xa-capable participants. In other words, tc_log_dummy is never used.

Ok, thanks for info.

 - Kristian.



Follow ups

References