maria-developers team mailing list archive

Thread
Date
Re: f1db957c5da: MDEV-21469: Implement crash-safe binary logging of the user XA

To: Sujatha <sujatha.sivakumar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Sun, 14 Mar 2021 20:16:52 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <jqfWjHSeG7N.A.CeD.v-lTgB@pweza>
Hi, Sujatha!

Note, this is a review of a combined diff dc92235f21e + f1db957c5da

On Mar 14, Sujatha wrote:
> revision-id: f1db957c5da (mariadb-10.5.2-247-gf1db957c5da)
> parent(s): dc92235f21e
> author: Sujatha <sujatha.sivakumar@xxxxxxxxxxx>
> committer: Andrei Elkin <andrei.elkin@xxxxxxxxxxx>
> timestamp: 2020-12-21 16:10:46 +0200
> message:
> 
> MDEV-21469: Implement crash-safe binary logging of the user XA
> 
> Make XA PREPARE, XA COMMIT and XA ROLLBACK statements
> crash-safe when --log-bin is specified.
> 
> At execution of XA PREPARE, XA COMMIT and XA ROLLBACK their replication
> events are made written into the binary log prior to execution
> of the commands in storage engines.
> In case the server crashes, after writing to binary log but before all
> involved engines have processed the command, the following recovery
> will execute the command's replication events to equalize the states
> of involved engines with that of binlog.
> That applies to all XA PREPARE *group* and XA COMMIT or ROLLBACK.
> 
> On the implementation level the recovery time binary log parsing
> is augmented to pay attention to
> the user XA xids to identify the XA transactions' state:s in binary log, and
> eventually match them against their states in engines, see
> MYSQL_BIN_LOG::recover_explicit_xa_prepare().
> In discrepancy cases the outdated state in the engines is corrected with
> resubmitting the transaction prepare group of events, or completion
> ones. The multi-engine partly prepared XA PREPARE case
> the XA is rolled back first.
> The fact of multiple-engine involved is registered into
> Gtid_log_event::flags2 as one bit. The boolean value is
> sufficient and precise to deal with two engines in XA transaction.
> With more than 2 recoverable engines the flag method is still correct though
> may be pessimistic, as it treats all recoverable engines as XA
> participants. So when the number of such Engines exceeds the number
> of prepared engines of the XA that XA is treated
> as partially completed, with all that ensued.
> As an optimization no new bit is allocated in flags2, instead a
> pre-existing ones (of MDEV-742) are reused, observing that
> A. XA "COMPLETE" does not require multi-engine hint for its recovery and that
> B. the MDEV-742 XA-completion bit is not anyhow used by
>    XA-PREPARE and its GTID log event.
> 
> Notice the multi-engine flagging is superceded by MDEV-21117 extension
> in Gtid log event so this part should be taken from there.

> diff --git a/mysql-test/suite/binlog/t/binlog_xa_multi_binlog_crash_recovery.test b/mysql-test/suite/binlog/t/binlog_xa_multi_binlog_crash_recovery.test
> new file mode 100644
> index 00000000000..d0852de2fcc
> --- /dev/null
> +++ b/mysql-test/suite/binlog/t/binlog_xa_multi_binlog_crash_recovery.test
> @@ -0,0 +1,86 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that XA crash recovery works fine across multiple binary logs.
> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate an explicit XA transaction. Using debug simulation hold the

"debug sync" != "debug sim" :)

> +#        execution of XA PREPARE statement after the XA PREPARE is written to
> +#        the binary log. With this the prepare will not be done in engine.
> +#    1 - By executing FLUSH LOGS generate multiple binary logs.
> +#    2 - Now make the server to disappear at this point.
> +#    3 - Restart the server. During recovery the XA PREPARE from the binary
> +#        log will be read. It is cross checked with engine. Since it is not
> +#        present in engine it will be executed once again.
> +#    4 - When server is up execute XA RECOVER to check that the XA is
> +#        prepared in engine as well.
> +#    5 - XA COMMIT the transaction and check the validity of the data.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +#
> +
> +--source include/have_innodb.inc
> +--source include/have_debug.inc
> +--source include/have_debug_sync.inc
> +--source include/have_log_bin.inc
> +
> +RESET MASTER;
> +
> +CREATE TABLE t1 (a INT PRIMARY KEY, b MEDIUMTEXT) ENGINE=Innodb;
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +
> +connect(con1,localhost,root,,);
> +XA START 'xa1';
> +INSERT INTO t1 SET a=1;
> +SET DEBUG_SYNC= "simulate_hang_after_binlog_prepare SIGNAL con1_ready WAIT_FOR con1_go";
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_prepare";
> +XA END 'xa1';
> +--send XA PREPARE 'xa1';
> +
> +connection default;
> +SET DEBUG_SYNC= "now WAIT_FOR con1_ready";
> +FLUSH LOGS;
> +FLUSH LOGS;
> +FLUSH LOGS;
> +
> +--source include/show_binary_logs.inc
> +--let $binlog_file= master-bin.000004
> +--let $binlog_start= 4
> +--source include/show_binlog_events.inc
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait

I don't see why you'd need to wait first, it'd be simpler to restart right away.

> +EOF
> +
> +--error 0,2013
> +SET DEBUG_SYNC= "now SIGNAL con1_go";
> +--source include/wait_until_disconnected.inc
> +
> +--connection con1
> +--error 2013
> +--reap
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart
> +EOF
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +XA RECOVER;
> +XA COMMIT 'xa1';
> +
> +SELECT * FROM t1;
> +
> +# Clean up.
> +connection default;
> +DROP TABLE t1;
> +
> diff --git a/mysql-test/suite/rpl/t/rpl_xa_commit_crash_safe.test b/mysql-test/suite/rpl/t/rpl_xa_commit_crash_safe.test
> new file mode 100644
> index 00000000000..e972e3f09de
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_xa_commit_crash_safe.test
> @@ -0,0 +1,98 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that XA COMMIT statements are crash safe.
> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate 2 explicit XA transactions. 'xa1' and 'xa2'.
> +#        'xa1' will be prepared and committed.
> +#    1 - For 'xa2' let the XA COMMIT be done in binary log and crash the
> +#        server so that it is not committed in engine.
> +#    2 - Restart the server. The recovery code should successfully recover
> +#        'xa2'. The COMMIT should be executed during recovery.
> +#    3 - Check the data in table. Both rows should be present in table.
> +#    4 - Trying to commit 'xa2' should report unknown 'XA' error as COMMIT is
> +#        already complete during recovery.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +
> +
> +--source include/have_innodb.inc
> +--source include/master-slave.inc

master-slave must be always included last

> +--source include/have_debug.inc
> +
> +connect (master2,localhost,root,,);

why do you need master2 connection?

> +--connection master
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +
> +CREATE TABLE t ( f INT ) ENGINE=INNODB;
> +XA START 'xa1';
> +INSERT INTO t VALUES (20);
> +XA END 'xa1';
> +XA PREPARE 'xa1';
> +XA COMMIT 'xa1';
> +--sync_slave_with_master
> +--source include/stop_slave.inc
> +
> +--connection master1
> +XA START 'xa2';
> +INSERT INTO t VALUES (40);
> +XA END 'xa2';
> +XA PREPARE 'xa2';
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait
> +EOF

Same. Why do you wait?

> +
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_commit_or_rollback";
> +--error 2013 # CR_SERVER_LOST
> +XA COMMIT 'xa2';
> +--source include/wait_until_disconnected.inc
> +
> +--connection master1
> +--source include/wait_until_disconnected.inc
> +
> +--connection master
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart
> +EOF
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +# rpl_end.inc needs to use the connection server_1
> +connection server_1;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection slave
> +--source include/start_slave.inc
> +--sync_with_master
> +
> +--connection master
> +SELECT * FROM t;
> +XA RECOVER;
> +--error 1397 # ER_XAER_NOTA
> +XA COMMIT 'xa2';
> +SELECT * FROM t;
> +--sync_slave_with_master
> +
> +SELECT * FROM t;
> +
> +--connection master
> +DROP TABLE t;
> +--sync_slave_with_master
> +--source include/rpl_end.inc
> diff --git a/mysql-test/suite/rpl/t/rpl_xa_event_apply_failure.test b/mysql-test/suite/rpl/t/rpl_xa_event_apply_failure.test
> new file mode 100644
> index 00000000000..14cebbd9b13
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_xa_event_apply_failure.test
> @@ -0,0 +1,119 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that if for some reason an event cannot be applied during
> +# recovery, appropriate error is reported.
> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate 2 explicit XA transactions. 'xa1' and 'xa2'.
> +#        'xa1' will be prepared and committed.
> +#    1 - For 'xa2' let the XA COMMIT be done in binary log and crash the
> +#        server so that it is not committed in engine.
> +#    2 - Restart the server. Using debug simulation point make XA COMMIT 'xa2'
> +#        execution to fail. The server will resume anyway
> +#        to leave the error in the errlog (see "Recovery: Error..").
> +#    3 - Work around the simulated failure with Commit once again
> +#        from a connection that turns OFF binlogging.
> +#        Slave must catch up with the master.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +
> +
> +--source include/have_innodb.inc
> +--source include/master-slave.inc

master-slave must be always included last

> +--source include/have_debug.inc
> +
> +connect (master2,localhost,root,,);

why do you need master2 connection?

> +--connection master
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +CALL mtr.add_suppression("Failed to execute binlog query event");
> +CALL mtr.add_suppression("Recovery: Error .Out of memory..");
> +CALL mtr.add_suppression("Crash recovery failed.");
> +CALL mtr.add_suppression("Can.t init tc log");
> +CALL mtr.add_suppression("Aborting");
> +
> +CREATE TABLE t ( f INT ) ENGINE=INNODB;
> +XA START 'xa1';
> +INSERT INTO t VALUES (20);
> +XA END 'xa1';
> +XA PREPARE 'xa1';
> +XA COMMIT 'xa1';
> +--sync_slave_with_master
> +--source include/stop_slave.inc
> +
> +--connection master1
> +XA START 'xa2';
> +INSERT INTO t VALUES (40);
> +XA END 'xa2';
> +XA PREPARE 'xa2';
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait

ditto

> +EOF
> +
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_commit_or_rollback";
> +--error 2013 # CR_SERVER_LOST
> +XA COMMIT 'xa2';
> +--source include/wait_until_disconnected.inc
> +
> +--connection master1
> +--source include/wait_until_disconnected.inc
> +
> +--connection master
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart: --debug-dbug=d,trans_xa_commit_fail
> +EOF
> +
> +connection default;
> +--source include/wait_until_disconnected.inc
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--enable_reconnect
> +--echo *** must be no 'xa2' commit seen, as it's still prepared:
> +SELECT * FROM t;
> +XA RECOVER;
> +
> +# Commit it manually now to work around the extra binlog record
> +# by turning binlogging OFF by the connection.
> +
> +SET GLOBAL DEBUG_DBUG="";
> +SET SQL_LOG_BIN=0;
> +--error 0

why error 0? What will happen if you don't disable binlog?

> +XA COMMIT 'xa2';
> +SET SQL_LOG_BIN=1;
> +
> +
> +# rpl_end.inc needs to use the connection server_1
> +connection server_1;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--source include/wait_until_connected_again.inc
> +
> +--connection slave
> +--source include/start_slave.inc
> +--sync_with_master
> +
> +--connection master
> +SELECT * FROM t;
> +
> +--sync_slave_with_master
> +SELECT * FROM t;
> +
> +--connection master
> +DROP TABLE t;
> +--sync_slave_with_master
> +--source include/rpl_end.inc
> diff --git a/mysql-test/suite/rpl/t/rpl_xa_prepare_commit_prepare.test b/mysql-test/suite/rpl/t/rpl_xa_prepare_commit_prepare.test
> new file mode 100644
> index 00000000000..7b987c7f29b
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_xa_prepare_commit_prepare.test
> @@ -0,0 +1,95 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that XA PREPARE transactions are crash safe.
> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate 2 explicit XA transactions. 'xa1' and 'xa2'.
> +#        'xa1' will be prepared and committed.
> +#    1 - For 'xa2' let the XA PREPARE be done in binary log and crash the
> +#        server so that it is not prepared in engine.
> +#    2 - Restart the server. The recovery code should successfully recover
> +#        'xa2'.
> +#    3 - When server is up, execute XA RECOVER and verify that 'xa2' is
> +#        present.
> +#    4 - Commit the XA transaction and verify its correctness.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +
> +--source include/have_innodb.inc
> +--source include/master-slave.inc

ditto

> +--source include/have_debug.inc
> +
> +connect (master2,localhost,root,,);

ditto

> +--connection master
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +
> +CREATE TABLE t ( f INT ) ENGINE=INNODB;
> +XA START 'xa1';
> +INSERT INTO t VALUES (20);
> +XA END 'xa1';
> +XA PREPARE 'xa1';
> +XA COMMIT 'xa1';
> +--sync_slave_with_master
> +--source include/stop_slave.inc
> +
> +--connection master1
> +XA START 'xa2';
> +INSERT INTO t VALUES (40);
> +XA END 'xa2';
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait

ditto

> +EOF
> +
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_prepare";
> +--error 2013 # CR_SERVER_LOST
> +XA PREPARE 'xa2';
> +--source include/wait_until_disconnected.inc
> +
> +--connection master1
> +--source include/wait_until_disconnected.inc
> +
> +--connection master
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart
> +EOF
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +# rpl_end.inc needs to use the connection server_1
> +connection server_1;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection slave
> +--source include/start_slave.inc
> +--sync_with_master
> +
> +--connection master
> +SELECT * FROM t;
> +XA RECOVER;
> +XA COMMIT 'xa2';
> +SELECT * FROM t;
> +--sync_slave_with_master
> +
> +SELECT * FROM t;
> +
> +--connection master
> +DROP TABLE t;
> +--sync_slave_with_master
> +--source include/rpl_end.inc
> diff --git a/mysql-test/suite/rpl/t/rpl_xa_prepare_crash_safe.test b/mysql-test/suite/rpl/t/rpl_xa_prepare_crash_safe.test
> new file mode 100644
> index 00000000000..9d2c5cce528
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_xa_prepare_crash_safe.test
> @@ -0,0 +1,117 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that XA PREPARE transactions are crash safe.
> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate 3 explicit XA transactions. 'xa1', 'xa2' and 'xa3'.
> +#        Using debug simulation hold the execution of second XA PREPARE
> +#        statement after the XA PREPARE is written to the binary log.
> +#        With this the prepare will not be done in engine.
> +#    1 - For 'xa3' allow the PREPARE statement to be written to binary log and
> +#        simulate server crash.
> +#    2 - Restart the server. The recovery code should successfully recover
> +#        'xa2' and 'xa3'.
> +#    3 - When server is up, execute XA RECOVER and verify that 'xa2' and 'xa3'
> +#        are present along with 'xa1'.
> +#    4 - Commit all the XA transactions and verify their correctness.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +
> +
> +--source include/have_innodb.inc
> +--source include/master-slave.inc

ditto

> +--source include/have_debug.inc
> +
> +connect (master2,localhost,root,,);

ok, in this test you actually use master2 :)

> +--connection master
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +CALL mtr.add_suppression("Found 2 prepared XA transactions");
> +CALL mtr.add_suppression("Found 3 prepared XA transactions");
> +
> +CREATE TABLE t ( f INT ) ENGINE=INNODB;
> +XA START 'xa1';
> +INSERT INTO t VALUES (20);
> +XA END 'xa1';
> +XA PREPARE 'xa1';
> +--sync_slave_with_master
> +--source include/stop_slave.inc
> +
> +--connection master1
> +use test;
> +xa start 'xa2';
> +insert into t values (30);
> +xa end 'xa2';
> +SET DEBUG_SYNC="simulate_hang_after_binlog_prepare SIGNAL reached WAIT_FOR go";
> +send xa prepare 'xa2';
> +
> +--connection master2
> +let $wait_condition=
> +  SELECT COUNT(*) = 1 FROM INFORMATION_SCHEMA.PROCESSLIST
> +    WHERE STATE like "debug sync point: simulate_hang_after_binlog_prepare%";
> +--source include/wait_condition.inc

eh? you can just 'WAIT_FOR signal', couldn't you?

> +
> +XA START 'xa3';
> +INSERT INTO t VALUES (40);
> +XA END 'xa3';
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait

ditto

> +EOF
> +
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_prepare";
> +--error 2013 # CR_SERVER_LOST
> +XA PREPARE 'xa3';
> +--source include/wait_until_disconnected.inc
> +
> +--connection master1
> +--error 2013
> +--reap
> +--source include/wait_until_disconnected.inc
> +
> +--connection master
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart
> +EOF
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +# rpl_end.inc needs to use the connection server_1
> +connection server_1;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +
> +--connection slave
> +--source include/start_slave.inc
> +--sync_with_master
> +
> +--connection master
> +SELECT * FROM t;
> +XA RECOVER;
> +XA COMMIT 'xa1';
> +XA COMMIT 'xa2';
> +XA COMMIT 'xa3';
> +SELECT * FROM t;
> +--sync_slave_with_master
> +
> +SELECT * FROM t;
> +
> +--connection master
> +DROP TABLE t;
> +--sync_slave_with_master
> +--source include/rpl_end.inc
> diff --git a/mysql-test/suite/rpl/t/rpl_xa_rollback_commit_crash_safe.test b/mysql-test/suite/rpl/t/rpl_xa_rollback_commit_crash_safe.test
> new file mode 100644
> index 00000000000..1d19f96116e
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_xa_rollback_commit_crash_safe.test
> @@ -0,0 +1,97 @@
> +# ==== Purpose ====
> +#
> +# Test verifies that XA COMMIT statements are crash safe.

XA ROLLBACK

> +#
> +# ==== Implementation ====
> +#
> +# Steps:
> +#    0 - Generate 2 explicit XA transactions. 'xa1' and 'xa2'.
> +#        'xa1' will be prepared and committed.
> +#    1 - For 'xa2' let the XA ROLLBACK be done in binary log and crash the
> +#        server so that it is not committed in engine.

not rolled back

> +#    2 - Restart the server. The recovery code should successfully recover
> +#        'xa2'. The ROLLBACK should be executed during recovery.
> +#    3 - Check the data in table. Only one row should be present in table.
> +#    4 - Trying to rollback 'xa2' should report unknown 'XA' error as rollback
> +#        is already complete during recovery.
> +#
> +# ==== References ====
> +#
> +# MDEV-21469: Implement crash-safe logging of the user XA
> +
> +--source include/have_innodb.inc
> +--source include/master-slave.inc

ditto

> +--source include/have_debug.inc
> +
> +connect (master2,localhost,root,,);

unused again

> +--connection master
> +CALL mtr.add_suppression("Found 1 prepared XA transactions");
> +
> +CREATE TABLE t ( f INT ) ENGINE=INNODB;
> +XA START 'xa1';
> +INSERT INTO t VALUES (20);
> +XA END 'xa1';
> +XA PREPARE 'xa1';
> +XA COMMIT 'xa1';
> +--sync_slave_with_master
> +--source include/stop_slave.inc
> +
> +--connection master1
> +XA START 'xa2';
> +INSERT INTO t VALUES (40);
> +XA END 'xa2';
> +XA PREPARE 'xa2';
> +
> +--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +wait

ditto

> +EOF
> +
> +SET GLOBAL DEBUG_DBUG="d,simulate_crash_after_binlog_commit_or_rollback";
> +--error 2013 # CR_SERVER_LOST
> +XA ROLLBACK 'xa2';
> +--source include/wait_until_disconnected.inc
> +
> +--connection master1
> +--source include/wait_until_disconnected.inc
> +
> +--connection master
> +--source include/wait_until_disconnected.inc
> +
> +#
> +# Server restart
> +#
> +--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +restart
> +EOF
> +
> +connection default;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +# rpl_end.inc needs to use the connection server_1
> +connection server_1;
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection master
> +--enable_reconnect
> +--source include/wait_until_connected_again.inc
> +
> +--connection slave
> +--source include/start_slave.inc
> +--sync_with_master
> +
> +--connection master
> +SELECT * FROM t;
> +XA RECOVER;
> +--error 1397 # ER_XAER_NOTA
> +XA ROLLBACK 'xa2';
> +SELECT * FROM t;
> +--sync_slave_with_master
> +
> +SELECT * FROM t;
> +
> +--connection master
> +DROP TABLE t;
> +--sync_slave_with_master
> +--source include/rpl_end.inc
> diff --git a/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.test b/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.test
> new file mode 100644
> index 00000000000..c7cefbda4bf
> --- /dev/null
> +++ b/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.test
> @@ -0,0 +1,46 @@
> +# MDEV-742, MDEV-21469 XA replication, and xa crash-safe.
> +# Tests prove xa state is recovered to a prepared or completed state
> +# upon post-crash recovery, incl a multi-engine case.
> +
> +--source include/have_rocksdb.inc
> +--source include/have_innodb.inc
> +--source include/have_debug.inc
> +--source include/master-slave.inc

ditto

> +--source include/have_binlog_format_row.inc
> +
> +--connection slave
> +--source include/stop_slave.inc
> +
> +--connection master
> +CALL mtr.add_suppression("Found . prepared XA transactions");
> +CALL mtr.add_suppression("Failed to execute binlog query event");
> +CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=rocksdb;
> +CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=innodb;
> +INSERT INTO t1 SET a = 1;
> +INSERT INTO t2 SET a = 1;
> +
> +--let $xa=xa1
> +--let $when=after_binlog
> +--source rpl_rocksdb_xa_recover.inc
> +
> +--let $xa=xa2
> +--let $when=after_first_engine
> +--let $finally_expected=$xa in the list
> +--source rpl_rocksdb_xa_recover.inc
> +
> +--connection slave
> +--source include/start_slave.inc
> +
> +--connection master
> +--sync_slave_with_master
> +let $diff_tables= master:t1, slave:t1;
> +source include/diff_tables.inc;
> +let $diff_tables= master:t2, slave:t2;
> +source include/diff_tables.inc;
> +
> +--connection master
> +SET SESSION DEBUG_DBUG="";
> +drop table t1,t2;
> +--sync_slave_with_master
> +
> +--source include/rpl_end.inc
> diff --git a/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.inc b/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.inc
> new file mode 100644
> index 00000000000..312b8b6a19e
> --- /dev/null
> +++ b/storage/rocksdb/mysql-test/rocksdb_rpl/t/rpl_rocksdb_xa_recover.inc
> @@ -0,0 +1,67 @@
> +# callee of rpl_rocksdb_xa_recover.test
> +# requires t1,t2 as defined in the caller.
> +
> +--let $at=_prepare
> +--let $finally_expected=$xa in the list
> +--let $init_val=`SELECT MAX(a) from t1 as t_1`

why as t_1 ?

> +--eval XA START '$xa'
> +--eval INSERT INTO t1 SET a=$init_val + 1
> +--eval INSERT INTO t2 SET a=$init_val + 1
> +--eval XA END '$xa'
> +--eval SET SESSION DEBUG_DBUG="d,simulate_crash_$when$at"
> +
> +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +--error 2013 # CR_SERVER_LOST
> +--eval XA PREPARE '$xa'
> +--source include/wait_until_disconnected.inc
> +--let $rpl_server_number = 1
> +--source include/rpl_reconnect.inc
> +--echo "*** $finally_expected"
> +XA RECOVER;
> +--eval XA ROLLBACK '$xa'
> +--eval SELECT MAX(a) - $init_val  as zero FROM t1
> +--eval SELECT MAX(a) - $init_val  as zero FROM t2
> +
> +
> +--let $at=_commit_or_rollback
> +--let $finally_expected=empty list (rolled back)
> +--eval XA START '$xa'
> +--eval INSERT INTO t1 SET a=$init_val + 2
> +--eval INSERT INTO t2 SET a=$init_val + 2
> +--eval XA END '$xa'
> +--eval XA PREPARE '$xa'
> +
> +--eval SET SESSION DEBUG_DBUG="d,simulate_crash_$when$at"
> +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +--error 2013 # CR_SERVER_LOST
> +--eval XA ROLLBACK '$xa'
> +--source include/wait_until_disconnected.inc
> +--let $rpl_server_number = 1
> +--source include/rpl_reconnect.inc
> +
> +--echo "*** $finally_expected"
> +XA RECOVER;
> +--eval SELECT MAX(a) - $init_val  as zero FROM t1
> +--eval SELECT MAX(a) - $init_val  as zero FROM t2
> +
> +
> +--let $at=_commit_or_rollback
> +--let $finally_expected=empty list (committed away)
> +--eval XA START '$xa'
> +--eval INSERT INTO t1 SET a=$init_val + 3
> +--eval INSERT INTO t2 SET a=$init_val + 3
> +--eval XA END '$xa'
> +--eval XA PREPARE '$xa'
> +
> +--eval SET SESSION DEBUG_DBUG="d,simulate_crash_$when$at"
> +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> +--error 2013 # CR_SERVER_LOST
> +--eval XA COMMIT '$xa'
> +--source include/wait_until_disconnected.inc
> +--let $rpl_server_number = 1
> +--source include/rpl_reconnect.inc
> +
> +--echo "*** $finally_expected"
> +XA RECOVER;
> +--eval SELECT MAX(a) - $init_val as three FROM t1
> +--eval SELECT MAX(a) - $init_val as three FROM t2
> diff --git a/sql/log_event.h b/sql/log_event.h
> index 639cbfbe7aa..e1880339e85 100644
> --- a/sql/log_event.h
> +++ b/sql/log_event.h
> @@ -3614,6 +3616,12 @@ class Gtid_log_event: public Log_event
>    static const uchar FL_PREPARED_XA= 64;
>    /* FL_"COMMITTED or ROLLED-BACK"_XA is set for XA transaction. */
>    static const uchar FL_COMPLETED_XA= 128;
> +  /*
> +    To mark the fact of multiple transactional engine participants
> +    in the prepared XA. The FL_COMPLETED_XA bit is reused by XA_PREPARE_LOG_EVENT,

What do you mean by that? XA_prepare_log_event does not inherit from
Gtid_log_event, it cannot use Gtid_log_event flags.

And, please, explain in a comment why you can reuse a bit for two different
flags. Like "the ambiguity between FL_COMPLETED_XA and FL_MULTI_ENGINE_XA is
resolved by the flag FL_PREPARED_XA. if it's set, then 128 means
FL_MULTI_ENGINE_XA, if it's not set, then 128 means FL_COMPLETED_XA"

> +    oth the XA completion events do not need such marking.

other

> +  */
> +  static const uchar FL_MULTI_ENGINE_XA= 128;
>  
>  #ifdef MYSQL_SERVER
>    Gtid_log_event(THD *thd_arg, uint64 seq_no, uint32 domain_id, bool standalone,
> diff --git a/sql/handler.h b/sql/handler.h
> index 0eff7bd930d..9acdd47bfc2 100644
> --- a/sql/handler.h
> +++ b/sql/handler.h
> @@ -1880,8 +1891,18 @@ class Ha_trx_info
>      m_ht= ht_arg;
>      m_flags= (int) TRX_READ_ONLY; /* Assume read-only at start. */
>  
> -    m_next= trans->ha_list;
> -    trans->ha_list= this;
> +    if (likely(!past_head))
> +    {
> +      m_next= trans->ha_list;
> +      trans->ha_list= this;
> +    }
> +    else
> +    {
> +      DBUG_ASSERT(trans->ha_list);
> +
> +      m_next= trans->ha_list->m_next;
> +      trans->ha_list->m_next= this;
> +    }

I don't like this fake generalization. There's no point for an arbitrary
engine to be put at a specific place in the list. Only binlog is special.

At least, remove this new argument and check for binlog_hton here.

But really, think of this - binlog is so special, may be it'd be easier not
to pretend it's a storage engine at all? No need for a fake binlog_hton,
no need to register it, etc. It kind of made sense in 2005, but it doesn't
necessarily make it now.

>    }
>  
>    /** Clear, prepare for reuse. */
> diff --git a/sql/handler.cc b/sql/handler.cc
> index ffb89b30d92..b981294f0d6 100644
> --- a/sql/handler.cc
> +++ b/sql/handler.cc
> @@ -1338,6 +1345,9 @@ int ha_prepare(THD *thd)
>        handlerton *ht= ha_info->ht();
>        if (ht->prepare)
>        {
> +        DBUG_EXECUTE_IF("simulate_crash_after_first_engine_prepare",
> +                        if (!ha_info->next()) DBUG_SUICIDE(););

looks like not simulate_crash_after_first_engine_prepare, but
more like simulate_crash_before_last_engine_prepare.

it's the same point in time if you have only two engines,
but the code still implements "before last", not "after first"

> +
>          if (unlikely(prepare_or_error(ht, thd, all)))
>          {
>            ha_rollback_trans(thd, all);
> @@ -1368,22 +1378,22 @@ int ha_prepare(THD *thd)
>  }
>  
>  /*
> -  Like ha_check_and_coalesce_trx_read_only to return counted number of
> -  read-write transaction participants limited to two, but works in the 'all'
> -  context.
> -  Also returns the last found rw ha_info through the 2nd argument.
> +  Returns counted number of
> +  read-write recoverable transaction participants optionally limited to two.
> +  Also optionally returns the last found rw ha_info through the 2nd argument.
>  */
> -uint ha_count_rw_all(THD *thd, Ha_trx_info **ptr_ha_info)
> +uint ha_count_rw_all(THD *thd, Ha_trx_info **ptr_ha_info, bool count_through)

I don't understand why you need a new argument. ptr_ha_info==NULL means don't
return a value there, this is a standard convention used everywhere.

also, it's not an "count rw all" it's are there more rw-all htons than N
where N can be 1 or 2.

So:
 * either take N as an argument and return as soon as rw_ha_count > N
 * or don't overoptimize here and just return full count every time

>  {
>    unsigned rw_ha_count= 0;
>  
>    for (auto ha_info= thd->transaction.all.ha_list; ha_info;
>         ha_info= ha_info->next())
>    {
> -    if (ha_info->is_trx_read_write())
> +    if (ha_info->is_trx_read_write() && ha_info->ht()->recover)
>      {
> -      *ptr_ha_info= ha_info;
> -      if (++rw_ha_count > 1)
> +      if (ptr_ha_info)
> +        *ptr_ha_info= ha_info;
> +      if (++rw_ha_count > 1 && !count_through)
>          break;
>      }
>    }
> @@ -1876,6 +1886,10 @@ commit_one_phase_2(THD *thd, bool all, THD_TRANS *trans, bool is_real_trans)
>      {
>        int err;
>        handlerton *ht= ha_info->ht();
> +
> +      DBUG_EXECUTE_IF("simulate_crash_after_first_engine_commit_or_rollback",
> +                      if (!ha_info->next()) DBUG_SUICIDE(););

1. same as above
2. why not to use *different* names for these two crashes?
in case, you know, you'll want to crash at a specific point in the code, not
just somewhere? And in a case you'll actually want to crash just somewhere you
can always use "+d,simulate_crash_after_first_engine_commit,simulate_crash_after_first_engine_rollback"
(and it should be "before_last", of course, not "after_first")


> +
>        if ((err= ht->commit(ht, thd, all)))
>        {
>          my_error(ER_ERROR_DURING_COMMIT, MYF(0), err);
> @@ -1987,6 +2001,10 @@ int ha_rollback_trans(THD *thd, bool all)
>      {
>        int err;
>        handlerton *ht= ha_info->ht();
> +
> +      DBUG_EXECUTE_IF("simulate_crash_after_first_engine_commit_or_rollback",
> +                      if (!ha_info->next()) DBUG_SUICIDE(););

ditto

> +
>        if ((err= ht->rollback(ht, thd, all)))
>        { // cannot happen
>          my_error(ER_ERROR_DURING_ROLLBACK, MYF(0), err);
> @@ -2098,8 +2116,24 @@ int ha_commit_or_rollback_by_xid(XID *xid, bool commit)
>    xaop.xid= xid;
>    xaop.result= 1;
>  
> +  if (binlog_hton->recover)
> +  {
> +    /*
> +      When the binlogging service is enabled complete the transaction
> +      by it first.
> +    */
> +    if (commit)
> +      binlog_hton->commit_by_xid(binlog_hton, xid);
> +    else
> +      binlog_hton->rollback_by_xid(binlog_hton, xid);
> +  }
>    plugin_foreach(NULL, commit ? xacommit_handlerton : xarollback_handlerton,
>                   MYSQL_STORAGE_ENGINE_PLUGIN, &xaop);
> +  if (binlog_hton->recover)
> +  {
> +    THD *thd= current_thd;
> +    thd->reset_binlog_completed_by_xid();
> +  }

1. this, precisely, shows that binlog does not need a hton anymoreo
2. how can binlog_hton->recover be NULL here?

>  
>    return xaop.result;
>  }
> @@ -2222,6 +2258,7 @@ static my_bool xarecover_handlerton(THD *unused, plugin_ref plugin,
>  
>    if (hton->recover)
>    {
> +    info->recover_htons++;

don't forget to subtract 1 for binlog_hton->recover,
otherwise it'll make you to rollback everything with a FL_MULTI_ENGINE_XA flag

>      while ((got= hton->recover(hton, info->list, info->len)) > 0 )
>      {
>        sql_print_information("Found %d prepared transaction(s) in %s",
> @@ -2263,7 +2300,21 @@ static my_bool xarecover_handlerton(THD *unused, plugin_ref plugin,
>              _db_doprnt_("ignore xid %s", xid_to_str(buf, info->list[i]));
>              });
>            xid_cache_insert(info->list + i);
> +          XID *foreign_xid= info->list + i;

swap those two lines ^^^ then you can write

       xid_cache_insert(foreign_xid);

>            info->found_foreign_xids++;
> +
> +           /*
> +             For each foreign xid prepraed in engine, check if it is present in
> +             xa_prepared_list of binlog.
> +           */
> +          if (info->xa_prepared_list)

for simplicity I'd rather always initialize xa_prepared_list and remove
this if()

> +          {
> +            struct xa_recovery_member *member= NULL;
> +            if ((member= (xa_recovery_member *)
> +                 my_hash_search(info->xa_prepared_list, foreign_xid->key(),
> +                                foreign_xid->key_length())))
> +              member->in_engine_prepare++;
> +          }
>            continue;
>          }
>          if (IF_WSREP(!(wsrep_emulate_bin_log &&
> @@ -2362,7 +2424,7 @@ int ha_recover(HASH *commit_list)
>                      info.found_my_xids, opt_tc_log_file);
>      DBUG_RETURN(1);
>    }
> -  if (info.commit_list)
> +  if (info.commit_list && !info.found_foreign_xids)

Why? after the finished crash recovery one can still have transactions in
doubt.

>      sql_print_information("Crash recovery finished.");
>    DBUG_RETURN(0);
>  }
> diff --git a/sql/log.cc b/sql/log.cc
> index 731bb3e98f0..c69b8518cf4 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -2019,28 +2027,49 @@ static int binlog_xa_recover_dummy(handlerton *hton __attribute__((unused)),
>    return 0;
>  }
>  
> -
> +/*
> +  The function invokes binlog_commit() and returns its result
> +  when it has not yet called it already.
> +  binlog_cache_mngr::completed_by_xid remembers the fact of
> +  the 1st of maximum two subsequent calls.
> +*/
>  static int binlog_commit_by_xid(handlerton *hton, XID *xid)
>  {
> +  int rc= 0;
>    THD *thd= current_thd;
> -
> -  (void) thd->binlog_setup_trx_data();
> +  binlog_cache_mngr *cache_mngr= thd->binlog_setup_trx_data();
>  
>    DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_COMMIT);
>  
> -  return binlog_commit(hton, thd, TRUE);
> -}
> +  if (!cache_mngr->completed_by_xid)

On what code path can you have completed_by_xid == false here?

> +  {
> +    rc= binlog_commit(hton, thd, TRUE);
> +    cache_mngr->completed_by_xid= true;
> +  }
>  
> +  return rc;
> +}
>  
> +/*
> +  The function invokes binlog_rollback() and returns its results similarly
> +  to a commit branch.
> +*/
>  static int binlog_rollback_by_xid(handlerton *hton, XID *xid)
>  {
> +  int rc= 0;
>    THD *thd= current_thd;
> -
> -  (void) thd->binlog_setup_trx_data();
> +  binlog_cache_mngr *cache_mngr= thd->binlog_setup_trx_data();
>  
>    DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_ROLLBACK ||
>                (thd->transaction.xid_state.get_state_code() == XA_ROLLBACK_ONLY));
> -  return binlog_rollback(hton, thd, TRUE);
> +
> +  if (!cache_mngr->completed_by_xid)
> +  {
> +    rc= binlog_rollback(hton, thd, TRUE);
> +    cache_mngr->completed_by_xid= true;
> +  }
> +
> +  return rc;
>  }
>  
>  
> @@ -2195,6 +2224,12 @@ static int binlog_commit(handlerton *hton, THD *thd, bool all)
>    if (!all)
>      cache_mngr->trx_cache.set_prev_position(MY_OFF_T_UNDEF);
>  
> +  DBUG_EXECUTE_IF("simulate_crash_after_binlog_commit_or_rollback",
> +                  DBUG_SUICIDE(););
> +  DEBUG_SYNC(thd, "simulate_hang_after_binlog_prepare");
> +  DBUG_EXECUTE_IF("simulate_crash_after_binlog_prepare",
> +                  DBUG_SUICIDE(););

wouldn't you say there's one DBUG_EXECUTE_IF too many?
Just keep one, "simulate_crash_after_binlog_commit"

> +
>    THD_STAGE_INFO(thd, org_stage);
>    DBUG_RETURN(error);
>  }
> @@ -2298,6 +2333,9 @@ static int binlog_rollback(handlerton *hton, THD *thd, bool all)
>      cache_mngr->trx_cache.set_prev_position(MY_OFF_T_UNDEF);
>    thd->reset_binlog_for_next_statement();
>  
> +  DBUG_EXECUTE_IF("simulate_crash_after_binlog_commit_or_rollback",

same as above, use unique dbug keywords, like
"simulate_crash_after_binlog_rollback" (because it's rollback here)

> +                  DBUG_SUICIDE(););
> +
>    DBUG_RETURN(error);
>  }
>  
> @@ -5749,6 +5794,14 @@ binlog_cache_mngr *THD::binlog_setup_trx_data()
>    DBUG_RETURN(cache_mngr);
>  }
>  
> +/*
> +  XA completion flag resetter for ha_commit_or_rollback_by_xid().
> +*/
> +void THD::reset_binlog_completed_by_xid()
> +{
> +  binlog_setup_trx_data()->reset_completed_by_xid();
> +}

I don't think this makes any sense as a THD method

> +
>  /*
>    Function to start a statement and optionally a transaction for the
>    binary log.
> @@ -10333,14 +10386,108 @@ start_binlog_background_thread()
>    return 0;
>  }
>  
> +#ifdef HAVE_REPLICATION
> +/**
> +  Auxiliary function for TC_LOG::recover().
> +  @returns a successfully created and inserted @c xa_recovery_member
> +             into hash @c hash_arg,
> +           or NULL.
> +*/
> +static xa_recovery_member*
> +xa_member_insert(HASH *hash_arg, xid_t *xid_arg, xa_binlog_state state_arg,
> +              MEM_ROOT *ptr_mem_root)
> +{
> +  xa_recovery_member *member= (xa_recovery_member*)
> +    alloc_root(ptr_mem_root, sizeof(xa_recovery_member));
> +  if (!member)
> +    return NULL;
> +
> +  member->xid.set(xid_arg);
> +  member->state= state_arg;
> +  member->in_engine_prepare= 0;
> +  return my_hash_insert(hash_arg, (uchar*) member) ? NULL : member;
> +}
> +
> +/* Inserts or updates an existing hash member with a proper state */
> +static bool xa_member_replace(HASH *hash_arg, xid_t *xid_arg, bool is_prepare,
> +                              MEM_ROOT *ptr_mem_root)
> +{
> +  if(is_prepare)
> +  {
> +    if (!(xa_member_insert(hash_arg, xid_arg, XA_PREPARE, ptr_mem_root)))
> +      return true;
> +  }
> +  else
> +  {
> +    /*
> +      Search if XID is already present in recovery_list. If found
> +      and the state is 'XA_PREPRAED' mark it as XA_COMPLETE.
> +      Effectively, there won't be XA-prepare event group replay.
> +    */
> +    xa_recovery_member* member;
> +    if ((member= (xa_recovery_member *)
> +         my_hash_search(hash_arg, xid_arg->key(), xid_arg->key_length())))
> +    {
> +      if (member->state == XA_PREPARE)
> +        member->state= XA_COMPLETE;
> +    }
> +    else // We found only XA COMMIT during recovery insert to list
> +    {
> +      if (!(member= xa_member_insert(hash_arg,
> +                                     xid_arg, XA_COMPLETE, ptr_mem_root)))
> +        return true;
> +    }
> +  }
> +  return false;
> +}
> +#endif
>  
> +extern "C" uchar *xid_get_var_key(xid_t *entry, size_t *length,

define the first argument as const char*, then you won't need to cast in
my_hash_init, and you can be sure you pass the correct function with the
correct number of arguments, even if my_hash_get_key changes in the future.

and cast entry to xid_t* below explicitly.

> +                              my_bool not_used __attribute__((unused)))
> +{
> +  *length= entry->key_length();
> +  return (uchar*) entry->key();
> +}
> +
> +/**
> +   Performs recovery based on transaction coordinator log for 2pc. At the
> +   time of crash, if the binary log was in active state, then recovery for
> +   "implicit" 'xid's and explicit 'XA' transactions is initiated,
> +   otherwise merely the gtid binlog state is updated.
> +   For 'xid' and 'XA' based recovery the following steps are performed.
> +
> +   Identify the active binlog checkpoint file.
> +   Scan the binary log from the beginning.
> +   From GTID_LIST and GTID_EVENTs reconstruct the gtid binlog state.
> +   Prepare a list of 'xid's for recovery.
> +   Prepare a list of explicit 'XA' transactions for recovery.
> +   Recover the 'xid' transactions.
> +   The explicit 'XA' transaction recovery is initiated once all the server
> +   components are initialized. Please check 'execute_xa_for_recovery()'.

why explicit XA recovery is delayed?

> +
> +   Called from @c MYSQL_BIN_LOG::do_binlog_recovery()
> +
> +   @param linfo          Store here the found log file name and position to
> +                         the NEXT log file name in the index file.
> +
> +   @param last_log_name  Name of the last active binary log at the time of
> +                         crash.
> +
> +   @param first_log      Pointer to IO_CACHE of active binary log
> +   @param fdle           Format_description_log_event of active binary log
> +   @param do_xa          Is 2pc recovery needed for 'xid's and explicit XA
> +                         transactions.
> +   @return               indicates success or failure of recovery.
> +    @retval 0 success
> +    @retval 1 failure
> +
> +*/
>  int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
>                             IO_CACHE *first_log,
>                             Format_description_log_event *fdle, bool do_xa)
>  {
>    Log_event *ev= NULL;
>    HASH xids;
> -  MEM_ROOT mem_root;
>    char binlog_checkpoint_name[FN_REFLEN];
>    bool binlog_checkpoint_found;
>    bool first_round;
> @@ -10533,10 +10693,22 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
>  
>    if (do_xa)
>    {
> -    if (ha_recover(&xids))
> +    if (ha_recover(&xids, &xa_recover_list, &xa_recover_htons))
>        goto err2;
>  
> -    free_root(&mem_root, MYF(0));
> +    DBUG_ASSERT(!xa_recover_list.records ||
> +                (binlog_checkpoint_found && binlog_checkpoint_name[0] != 0));

why is that?

> +
> +    if (!xa_recover_list.records)
> +    {
> +      free_root(&mem_root, MYF(0));
> +      my_hash_free(&xa_recover_list);
> +    }
> +    else
> +    {
> +      xa_binlog_checkpoint_name= strmake_root(&mem_root, binlog_checkpoint_name,
> +                                              strlen(binlog_checkpoint_name));
> +    }
>      my_hash_free(&xids);
>    }
>    return 0;
> @@ -10561,6 +10734,219 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
>    return 1;
>  }
>  
> +void MYSQL_BIN_LOG::execute_xa_for_recovery()
> +{
> +  if (xa_recover_list.records)
> +    (void) recover_explicit_xa_prepare();
> +  free_root(&mem_root, MYF(0));
> +  my_hash_free(&xa_recover_list);
> +};
> +
> +/**
> +   Performs recovery of user XA transactions.
> +   'xa_recover_list' contains the list of XA transactions to be recovered.
> +   with possible replaying replication event from the binary log.
> +
> +   @return        indicates success or failure of recovery.
> +   @retval        false success
> +   @retval        true  failure
> +
> +*/
> +bool MYSQL_BIN_LOG::recover_explicit_xa_prepare()

why is it bool, if you only use it in execute_xa_for_recovery()
and ignore the return value?

> +{
> +#ifndef HAVE_REPLICATION
> +  /* Can't be supported without replication applier built in. */
> +  return false;
> +#else
> +  bool err= true;
> +  int error=0;
> +  Relay_log_info *rli= NULL;
> +  rpl_group_info *rgi;
> +  THD *thd= new THD(0);  /* Needed by start_slave_threads */
> +  thd->thread_stack= (char*) &thd;
> +  thd->store_globals();
> +  thd->security_ctx->skip_grants();
> +  IO_CACHE log;
> +  const char *errmsg;
> +  File        file;
> +  bool enable_apply_event= false;
> +  Log_event *ev = 0;
> +  LOG_INFO linfo;
> +  int recover_xa_count= xa_recover_list.records;
> +  xa_recovery_member *member= NULL;
> +
> +  if (!(rli= thd->rli_fake= new Relay_log_info(FALSE, "Recovery")))
> +  {
> +    my_error(ER_OUTOFMEMORY, MYF(ME_FATAL), 1);
> +    goto err2;
> +  }
> +  rli->sql_driver_thd= thd;
> +  static LEX_CSTRING connection_name= { STRING_WITH_LEN("Recovery") };
> +  rli->mi= new Master_info(&connection_name, false);
> +  if (!(rgi= thd->rgi_fake))
> +    rgi= thd->rgi_fake= new rpl_group_info(rli);
> +  rgi->thd= thd;
> +  thd->system_thread_info.rpl_sql_info=
> +    new rpl_sql_thread_info(rli->mi->rpl_filter);
> +
> +  if (rli && !rli->relay_log.description_event_for_exec)
> +  {
> +    rli->relay_log.description_event_for_exec=
> +      new Format_description_log_event(4);
> +  }
> +  if (find_log_pos(&linfo, xa_binlog_checkpoint_name, 1))
> +  {
> +    sql_print_error("Binlog file '%s' not found in binlog index, needed "
> +                    "for recovery. Aborting.", xa_binlog_checkpoint_name);
> +    goto err2;
> +  }
> +
> +  tmp_disable_binlog(thd);
> +  thd->variables.pseudo_slave_mode= TRUE;
> +  for (;;)
> +  {
> +    if ((file= open_binlog(&log, linfo.log_file_name, &errmsg)) < 0)
> +    {
> +      sql_print_error("%s", errmsg);
> +      goto err1;
> +    }
> +    while (recover_xa_count > 0 &&
> +        (ev= Log_event::read_log_event(&log,
> +                                       rli->relay_log.description_event_for_exec,
> +                                       opt_master_verify_checksum)))
> +    {
> +      if (!ev->is_valid())
> +      {
> +        sql_print_error("Found invalid binlog query event %s"
> +                        " at %s:%llu; error %d %s", ev->get_type_str(),
> +                        linfo.log_file_name,
> +                        (ev->log_pos - ev->data_written));
> +        goto err1;
> +      }
> +      enum Log_event_type typ= ev->get_type_code();
> +      ev->thd= thd;
> +
> +      if (typ == FORMAT_DESCRIPTION_EVENT)
> +        enable_apply_event= true;
> +
> +      if (typ == GTID_EVENT)
> +      {
> +        Gtid_log_event *gev= (Gtid_log_event *)ev;
> +        if (gev->flags2 &
> +            (Gtid_log_event::FL_PREPARED_XA | Gtid_log_event::FL_COMPLETED_XA))
> +        {
> +          if ((member=
> +               (xa_recovery_member*) my_hash_search(&xa_recover_list,
> +                                                    gev->xid.key(),
> +                                                    gev->xid.key_length())))
> +          {
> +            /*
> +              When XA PREPARE group of events (as flagged so) check
> +              its actual binlog state which may be COMPLETED. If the
> +              state is also PREPARED then analyze through
> +              in_engine_prepare whether the transaction needs replay.
> +             */
> +            if (gev->flags2 & Gtid_log_event::FL_PREPARED_XA)
> +            {
> +              if (member->state == XA_PREPARE)
> +              {
> +                // XA prepared is not present in (some) engine then apply it
> +                if (member->in_engine_prepare == 0)
> +                  enable_apply_event= true;
> +                else if (gev->flags2 & Gtid_log_event::FL_MULTI_ENGINE_XA &&
> +                         xa_recover_htons > member->in_engine_prepare)

This needs a comment, something like "FL_MULTI_ENGINE_XA says the transaction
must be prepared in more than one engine. We don't know in how many, so if it's
not prepared in *all* engines we'll replay it just in case"

I presume that (a bit silly) logic will do away when you will port this
patch to use a counter instead of a flag. Please, don't push if before
that.

> +                {
> +                  enable_apply_event= true;
> +                  // partially engine-prepared XA is first cleaned out prior replay
> +                  thd->lex->sql_command= SQLCOM_XA_ROLLBACK;
> +                  ha_commit_or_rollback_by_xid(&gev->xid, 0);
> +                }
> +                else
> +                  --recover_xa_count;
> +              }
> +            }
> +            else if (gev->flags2 & Gtid_log_event::FL_COMPLETED_XA)
> +            {
> +              if (member->state == XA_COMPLETE &&
> +                  member->in_engine_prepare > 0)
> +                enable_apply_event= true;

why? you cannot replay a fully prepared and partially committed transaction

> +              else
> +                --recover_xa_count;
> +            }
> +          }
> +        }
> +      }
> +
> +      if (enable_apply_event)
> +      {
> +        if ((err= ev->apply_event(rgi)))
> +        {
> +            sql_print_error("Failed to execute binlog query event of type: %s,"
> +                            " at %s:%llu; error %d %s", ev->get_type_str(),
> +                            linfo.log_file_name,
> +                            (ev->log_pos - ev->data_written),
> +                            thd->get_stmt_da()->sql_errno(),
> +                            thd->get_stmt_da()->message());
> +            delete ev;
> +            goto err1;
> +        }
> +        else if (typ == FORMAT_DESCRIPTION_EVENT)
> +          enable_apply_event=false;

how can that happen?

> +        else if (thd->lex->sql_command == SQLCOM_XA_PREPARE ||
> +                 thd->lex->sql_command == SQLCOM_XA_COMMIT  ||
> +                 thd->lex->sql_command == SQLCOM_XA_ROLLBACK)
> +        {
> +          --recover_xa_count;
> +          enable_apply_event=false;
> +
> +          sql_print_information("Binlog event %s at %s:%llu"
> +              " successfully applied",
> +              typ == XA_PREPARE_LOG_EVENT ?
> +              static_cast<XA_prepare_log_event *>(ev)->get_query() :
> +              static_cast<Query_log_event *>(ev)->query,
> +              linfo.log_file_name, (ev->log_pos - ev->data_written));
> +        }
> +      }
> +      if (typ != FORMAT_DESCRIPTION_EVENT)
> +        delete ev;
> +    }
> +    end_io_cache(&log);
> +    mysql_file_close(file, MYF(MY_WME));
> +    file= -1;
> +    if (unlikely((error= find_next_log(&linfo, 1))))
> +    {
> +      if (error != LOG_INFO_EOF)
> +        sql_print_error("find_log_pos() failed (error: %d)", error);
> +      else
> +        break;
> +    }
> +  }
> +err1:
> +  reenable_binlog(thd);
> +  /*
> +    There should be no more XA transactions to recover upon successful
> +    completion.
> +  */
> +  if (recover_xa_count > 0)
> +    goto err2;
> +  sql_print_information("Crash recovery finished.");
> +  err= false;
> +err2:
> +  if (file >= 0)
> +  {
> +    end_io_cache(&log);
> +    mysql_file_close(file, MYF(MY_WME));
> +  }
> +  thd->variables.pseudo_slave_mode= FALSE;
> +  delete rli->mi;
> +  delete thd->system_thread_info.rpl_sql_info;
> +  rgi->slave_close_thread_tables(thd);
> +  thd->reset_globals();
> +  delete thd;
> +
> +  return err;
> +#endif  /* !HAVE_REPLICATION */
> +}
>  
>  int
>  MYSQL_BIN_LOG::do_binlog_recovery(const char *opt_name, bool do_xa_recovery)

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx