← Back to team overview

maria-developers team mailing list archive

Re: [Commits] 7cabdc461b2: MDEV-6860 Parallel async replication hangs on a Galera node

 

Sachin Setiya <sachin.setiya@xxxxxxxxxxx> writes:

> On Mon, Jul 15, 2019 at 4:00 PM Kristian Nielsen
> <knielsen@xxxxxxxxxxxxxxx> wrote:

>> (I wonder if this isn't just another symptom of the underlying problem that
>> Galera has never been integrated properly into MariaDB and the group commit
>> algorithm / transaction master?).

> For example lets us consider the replication A -> B <==> C (A,B
> parallel replication optimistic, B,C Galera cluster nodes)
> Lets assume 2 inserts(T1 gtid x-x-1 and T2 x-x-2) from master A arrive
> to slave B.

> 2nd insert prepares faster then 1st insert, So it has already sent the
> writeset to node C. Now it is the queue waiting for its turn to commit

And this is the problem, IIUC. T2 has registered with the transaction
coordinator that it goes after T1. Galera is not allowed to put the T2
writeset ahead of T1 since T1 is required to commit before T2.

Basically, the transaction coordinator is the one that decides the commit
order. In non-Galera MariaDB, the transaction coordinator is in the binlog
group commit. But Galera needs to decide the commit order itself (that's the
core of its synchronous replication architecture). So Galera needs to take
over the role of the transaction coordinator, replacing the corresponding
logic in the binlog group commit.

This starts at TC_LOG_BINLOG::log_and_order(). There are already two
alternate transaction coordinators (the other is
TC_LOG_MMAP::log_and_order()). The whole system is designed so that
something like Galera would implement TC_LOG_GALERA::log_and_order() and
interface to the rest of MariaDB with functions like commit_ordered() and so
on. This is the right place to start fixing all these problems that Galera
has shown in MariaDB over the years, with root cause in disagreement over
who decides the commit order.

> While the first insert does prepare on galera
> (wsrep_run_wsrep_commit), but it is stuck because T2 transaction still
> haven't run post_commit on galera

If Galera was the transaction coordinator, it could know that T1 goes before
T2 in commit order, and it could have prevented T1 from getting stuck
waiting for T2.

Hope this helps,

 - Kristian.


References