← Back to team overview

maria-developers team mailing list archive

Re: [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera

 

Nirbhay Choubey <nirbhay@xxxxxxxxxxx> writes:

>> > ##### Case 7: Stop slave into the middle of a transaction being filtered
>> and
>> > #             start it back with filtering disabled.
>> >
>> > --echo # On master
>> > connection master;
>> > SET @@session.gtid_domain_id=1;
>> > BEGIN;
>> > INSERT INTO t2 VALUES(3);
>> > INSERT INTO t3 VALUES(3);
>> > sync_slave_with_master;
>>
>> No, this does not work. Transactions are always binlogged as a whole on the
>> master, during COMMIT.
>>
>
> You are right. My original intent was to test a transaction which modifies
> both MyISAM and
> InnoDB tables, where first modification is done in MyISAM table. In which
> case the changes
> to MyISAM is sent to the slave right away, while rest of trx is sent on
> commit. I have modified
> the test accordingly.

I'm still not sure you understand the scenario I had in mind. It's not about
what happens on the master during the transaction. It is about what happens in
case the slave disconnects in the middle of receiving an event
group/transaction.

In general in replication, the major part of the work is not implementing the
functionality for the normal case - that is usually relatively easy. The major
part is handling and testing all the special cases that can occur in special
scenarios, especially various error cases. The replication code is really
complex in this respect, and the fact that things by their nature happen in
parallel between different threads and different servers make things even more
complex.

What I wanted you to think about here is what happens if the slave is
disconnected from the master after having received the first half of an event
group. For example due to network error. This will not happen normally in a
mysql-test-case run, and if it happens in a production site for a user, it
will be extremely hard to track down.

In this case, the second half of the event group could be received much later
than the first half. The IO thread could have been stopped (or even the whole
mysqld server could have been stopped) in-between, and the replication could
have been re-configured with CHANGE MASTER. Since the IO thread is doing the
filtering, it seems very important to consider what will happen if eg. filters
are enabled while receiving the first half of the transaction, but disabled
while receiving the second half:

Suppose we have this transaction:

  BEGIN GTID 2-1-100
  INSERT INTO t1 VALUES (1);
  INSERT INTO t1 VALUES (2);
  COMMIT;

What happens in the following scenario?

  CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2);
  START SLAVE;
  # slave IO thread connects to master;
  # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1);
  # slave IO thread is disconnected from master
  STOP SLAVE;
  # slave mysqld process is stopped and restarted.
  CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=();
  START SLAVE;
  # slave IO thread connects to master;
  # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;

Are you sure that this will work correctly? And what does "work correctly"
mean in this case? Will the transaction be completely ignored? Or will it be
completely replicated on the slave? The bug would be if the first half would
be ignored, but the second half still written into the relay log.

To test this, you would need to use DBUG error insertion. There are already
some tests that do this. They use for example

  SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";

The code will then (in debug builds) simulate a disconnect at some particular
point in the replication stream, allowing this rare but important case to be
tested. This is done using DBUG_EXECUTE_IF() in the code.

To work on replication without introducing nasty bugs, it is important to
think through cases like this carefully, and to convince yourself that things
will work correctly. Disconnects at various points, crashes on the master or
slave, errors during applying events or writing to the relay logs, and so on.

Hope this helps,

 - Kristian.


Follow ups

References