← Back to team overview

maria-discuss team mailing list archive

Re: Known limitation with TokuDB in Read Free Replication & parallel replication ?

 

Hello All,
I have been running sysbench oltp with a mariadb 10.1 master-slave
topology.  I have not seen any replication errors when slave parallel mode
is conservative.

However, when I configure slave parallel mode to optimistic and slave
parallel threads = 2, I get a lock timeout replication error with TokuDB.
Just before the lock timeout error fires (which requires a tokudb lock
timeout to occur), I see the one of the replication threads waiting for a
lock held by the other replication thread.  gdb shows the first thread
waiting on a lock inside of tokudb.  the other thread is stalled when
committing the transaction in wait_for_prior_commit_2 <-
wait_for_prior_commit <- THD::wait_for_prior_commit <-
TC_LOG_MMAP::log_and_order <- ha_commit_trans.

Is TokuDB supposed to call the thd report wait for API just prior to a
thread about to wait on a tokudb lock?



On Sun, Aug 7, 2016 at 7:50 PM, jocelyn fournier <jocelyn.fournier@xxxxxxxxx
> wrote:

> Hi Kristian,
>
>
> Just FYI I confirm the "Lock wait timeout exceeded; try restarting
> transaction" behaviour you described.
>
> I've duplicated & modified the rpl_parallel_optimistic.test and run it
> into storage/tokudb/mysql-test/tokudb_rpl/t/rpl_parallel_optimistic.test :
>
> ./mtr --suite=tokudb_rpl <1:33:48
> Logging: ./mtr  --suite=tokudb_rpl
> vardir: /home/joce/mariadb-10.1.16/mysql-test/var
> Checking leftover processes...
> Removing old var directory...
> Creating var directory '/home/joce/mariadb-10.1.16/mysql-test/var'...
> Checking supported features...
> MariaDB Version 10.1.16-MariaDB-debug
>  - SSL connections supported
>  - binaries are debug compiled
> Using suites: tokudb_rpl
> Collecting tests...
> Installing system database...
> ============================================================
> ==================
>
> TEST                                      RESULT   TIME (ms) or COMMENT
> --------------------------------------------------------------------------
>
> worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
> worker[1] mysql-test-run: WARNING: running this script as _root_ will
> cause some tests to be skipped
> tokudb_rpl.rpl_parallel_optimistic 'innodb_plugin,mix' [ fail ]
>         Test ended at 2016-08-08 01:26:34
>
> CURRENT_TEST: tokudb_rpl.rpl_parallel_optimistic
> mysqltest: In included file "./include/sync_with_master_gtid.inc":
> included from /home/joce/mariadb-10.1.16/storage/tokudb/mysql-test/tokudb_
> rpl/t/rpl_parallel_optimistic.test at line 59:
> At line 50: Failed to sync with master
>
> The result from queries just before the failure was:
> < snip >
> DELETE FROM t1 WHERE a=2;
> INSERT INTO t1 VALUES (2,5);
> DELETE FROM t1 WHERE a=3;
> INSERT INTO t1 VALUES(3,2);
> DELETE FROM t1 WHERE a=1;
> INSERT INTO t1 VALUES(1,2);
> DELETE FROM t1 WHERE a=3;
> INSERT INTO t1 VALUES(3,3);
> DELETE FROM t1 WHERE a=2;
> INSERT INTO t1 VALUES (2,6);
> include/save_master_gtid.inc
> SELECT * FROM t1 ORDER BY a;
> a    b
> 1    2
> 2    6
> 3    3
> include/start_slave.inc
> include/sync_with_master_gtid.inc
> Timeout in master_gtid_wait('0-1-20', 120), current slave GTID position
> is: 0-1-3.
> Slave state : Waiting for master to send event    127.0.0.1 root    16000
>   1    master-bin.000001    3468 slave-relay-bin.000002    796
> master-bin.000001    Yes    No                         1205    Lock wait
> timeout exceeded; try restarting transaction    0    772    3790    None
>     0 No                            No    0        1205    Lock wait
> timeout exceeded; try restarting transaction        1 Slave_Pos    0-1-20
>           optimistic
>
>
> I've no explanation so far for the DUPLICATE KEY error I've seen.
>
>
>   Jocelyn
>
>
> Le 15/07/2016 à 17:09, Kristian Nielsen a écrit :
>
>> jocelyn fournier <jocelyn.fournier@xxxxxxxxx> writes:
>>
>> Thanks for the quick answer! I wonder if it would be possible the
>>> automatically disable the optimistic parallel replication for an
>>> engine if it does not implement it ?
>>>
>> That would probably be good - though it would be better to just implement
>> the necessary API, it's a very small change (basically TokuDB just needs
>> to
>> inform the upper layer of any lock waits that take place inside).
>>
>> However, looking more at your description, you got a "key not found"
>> error. Not implementing the thd_report_wait_for() could lead to deadlocks,
>> but it shouldn't cause key not found. In fact, in optimistic mode, all
>> errors are treated as "deadlock" errors, the query is rolled back, and
>> run again, this time not in parallel.
>>
>> So I'm wondering if there is something else going on. If transactions T1
>> and
>> T2 run in parallel, it's possible that they have a row conflict. But if T2
>> deleted a row expected by T1, I would expect T1 to wait on a row lock held
>> by T2, not get a duplicate key error. And if T1 has not yet inserted a row
>> expected by T2, then T2 would be rolled back and retried after T1 has
>> committed. The first can cause deadlock, but neither case seems to cause
>> duplicate error.
>>
>> Maybe TokuDB is doing something special with locks around replication, or
>> something else goes wrong. I guess TokuDB just hasn't been tested much
>> with
>> parallel replication.
>>
>> Does it work ok when running in conservative parallel mode?
>>
>>   - Kristian.
>>
>
>

Follow ups

References