← Back to team overview

maria-developers team mailing list archive

Re: 答复: 答复: 答复: 答复: MDEV-520: consider parallel replication patch from taobao patches

 

Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

> I will continue and look deeper in the rpl_deadlock_innodb failure and in the
> other issues.

Ok, I debugged the problem in rpl.rpl_deadlock_innodb where I get this
failure:

CURRENT_TEST: rpl.rpl_deadlock_innodb
mysqltest: In included file "./include/wait_for_slave_param.inc": 
included from ./include/wait_for_slave_sql_error.inc at line 41:
included from ./extra/rpl_tests/rpl_deadlock.test at line 84:
included from /home/knielsen/my/10.0/work-10.0-mdev520/mysql-test/suite/rpl/t/rpl_deadlock_innodb.test at line 6:
At line 115: Timeout in include/wait_for_slave_param.inc

This test case tries to replicate a transaction, but the slave is blocked by
row locks held by a user transaction. So the slave transaction gets a "Lock wait
timeout exceeded" error, and retries the transaction, this repeats until
@@global.slave_transaction_retries is exceeded.

The test case waits for the maximum number of retries to happen and the slave
to stop with an error. However, this does not happen in your
parallel-replication tree. Instead the slave loops endlessly retrying (and
timing out) the transaction.

The transaction is retried the correct number of times, and then an error is
returned from execute_single_transaction(). But somehow this error is not
caught correctly, and the slave is not stopped.

Instead, execute_single_transaction() gets called again with the *same*
transaction, and it fails again, and so on endlessly. Until the test case
itself times out and gives up waiting for the slave to stop with an error.

I did not so far find exactly where the error check is missing, but it must be
somewhere up in the call chain of execute_single_transaction(). It needs to
catch the error somewhere and stop the slave and set the error code and
message for SHOW SLAVE STATUS. I hope you can sort it out from there, else ask
again.

By the way, while debugging I found something else that may be an error
also. I was replicating three CREATE TABLE statements in sequence:

    CREATE TABLE t1 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB;
    CREATE TABLE t2 (a INT) ENGINE=InnoDB;
    CREATE TABLE t3 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB;

It looks as if those are executed as a single transaction (a single call to
execute_single_transaction()). Is this on purpose? My guess is your code may
not correctly handle event groups that are not bracketed by BEGIN
... END. Basically, if there is no BEGIN ... END, then the event is an event
group by itself, however these events form a grop with the following event(s)
and do not constitute a group by themselves: INTVAR_EVENT RAND_EVENT
USER_VAR_EVENT TABLE_MAP_EVENT ANNOTATE_ROWS_EVENT.

----

The logic around event execution failure and retry and so on (in the normal
slave code) is quite tricky, and it seems likely that there will be other
issues to deal with :-/. Hopefully the above can get you a bit further.

For the long-term, I will try to get hold of Monty and discuss with him how to
improve the slave SQL thread code. We already have multiple SQL threads for
multi-slave, now your patch has multiple threads for your parallel
replication, and we may get even more threads for other features. I am hoping
we could make a general refactoring to support properly multiple threads,
where all of the event apply and error handling code can be cleaned up and
re-used by all the different features. Then your job will become a bit easier.

 - Kristian.


References