← Back to team overview

maria-developers team mailing list archive

Re: [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera

 

Hi!

On Fri, Nov 14, 2014 at 3:58 AM, Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
wrote:

> Nirbhay Choubey <nirbhay@xxxxxxxxxxx> writes:
>
> >> > ##### Case 7: Stop slave into the middle of a transaction being
> filtered
> >> and
> >> > #             start it back with filtering disabled.
> >> >
> >> > --echo # On master
> >> > connection master;
> >> > SET @@session.gtid_domain_id=1;
> >> > BEGIN;
> >> > INSERT INTO t2 VALUES(3);
> >> > INSERT INTO t3 VALUES(3);
> >> > sync_slave_with_master;
> >>
> >> No, this does not work. Transactions are always binlogged as a whole on
> the
> >> master, during COMMIT.
> >>
> >
> > You are right. My original intent was to test a transaction which
> modifies
> > both MyISAM and
> > InnoDB tables, where first modification is done in MyISAM table. In which
> > case the changes
> > to MyISAM is sent to the slave right away, while rest of trx is sent on
> > commit. I have modified
> > the test accordingly.
>
> I'm still not sure you understand the scenario I had in mind. It's not
> about
> what happens on the master during the transaction. It is about what
> happens in
> case the slave disconnects in the middle of receiving an event
> group/transaction.
>

You are perhaps looking at an older version of the test. The latest says :

<cut>
##### Case 7: Stop slave before a transaction (involving MyISAM and InnoDB
#             table) being filtered commits and start it back with filtering
#             disabled.
...
</cut>


> In general in replication, the major part of the work is not implementing
> the
> functionality for the normal case - that is usually relatively easy. The
> major
> part is handling and testing all the special cases that can occur in
> special
> scenarios, especially various error cases. The replication code is really
> complex in this respect, and the fact that things by their nature happen in
> parallel between different threads and different servers make things even
> more
> complex.
>
> What I wanted you to think about here is what happens if the slave is
> disconnected from the master after having received the first half of an
> event
> group. For example due to network error. This will not happen normally in a
> mysql-test-case run, and if it happens in a production site for a user, it
> will be extremely hard to track down.
>
> In this case, the second half of the event group could be received much
> later
> than the first half. The IO thread could have been stopped (or even the
> whole
> mysqld server could have been stopped) in-between, and the replication
> could
> have been re-configured with CHANGE MASTER. Since the IO thread is doing
> the
> filtering, it seems very important to consider what will happen if eg.
> filters
> are enabled while receiving the first half of the transaction, but disabled
> while receiving the second half:


> Suppose we have this transaction:
>
>   BEGIN GTID 2-1-100
>   INSERT INTO t1 VALUES (1);
>   INSERT INTO t1 VALUES (2);
>   COMMIT;
>
> What happens in the following scenario?
>
>   CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2);
>   START SLAVE;
>   # slave IO thread connects to master;
>   # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1);
>   # slave IO thread is disconnected from master
>   STOP SLAVE;
>   # slave mysqld process is stopped and restarted.
>   CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=();
>   START SLAVE;
>   # slave IO thread connects to master;
>   # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;
>
> Are you sure that this will work correctly? And what does "work correctly"
> mean in this case? Will the transaction be completely ignored? Or will it
> be
> completely replicated on the slave? The bug would be if the first half
> would
> be ignored, but the second half still written into the relay log.
>
> To test this, you would need to use DBUG error insertion. There are already
> some tests that do this. They use for example
>
>   SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";
>
> The code will then (in debug builds) simulate a disconnect at some
> particular
> point in the replication stream, allowing this rare but important case to
> be
> tested. This is done using DBUG_EXECUTE_IF() in the code.
>

I had already added multiple cases under rpl_domain_id_filter_io_crash.test
using
DBUG_EXECUTE_IF("+d,"kill_io_slave_before_commit") in the previous commit.
Even though, it is not exactly similar to what you suggest, it does,
however,try to
kill I/O thread when it receives COMMIT/XID event (cases 0 - 3) in order to
test what
happens when I/O exits before reading the complete transaction or group
with filtering
enable before/after slave restart.

Following your suggestion, I have now added 2 more cases (4 and 5) using
DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading
first INSERT in a transaction. The outcome is expected.


>
> To work on replication without introducing nasty bugs, it is important to
> think through cases like this carefully, and to convince yourself that
> things
> will work correctly. Disconnects at various points, crashes on the master
> or
> slave, errors during applying events or writing to the relay logs, and so
> on.
>

I agree.


>
> Hope this helps,
>

Indeed.

Best,
Nirbhay

Follow ups

References