← Back to team overview

maria-developers team mailing list archive

Re: [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera

 

Nirbhay Choubey <nirbhay@xxxxxxxxxxx> writes:

> [NC] You are right, setting both DO_ and IGNORE_ at the same time would not
> make any sense.
> But I actually followed the pattern of the existing replication, where both
> _do_ and _ignore_
> filters are allowed to hold values at the same time and _do_ is preferred
> in case both are set.

But if it does not make any sense, why allow it? Not giving an error will just
risk confuse users, there seems no benefit.

So I still think we need an error.

>>  - Add a test for the following scenario when using GTID mode: Some domain
>> id
>>    is ignored, some events are received from the master and ignored. Then
>> the
>>    slave is stopped, and the ignore rule is removed. What happens when the
>>    slave is restarted and reconnects? I think the previously ignored
>> domain id
>>    events will now be re-fetched and replicated. I guess that's ok, but it
>>    should be documented. Also test what happens when binlog files on the
>>    master have been rotated and purged, so that the events are no longer
>>    available on the master (an error should occur).
>>
>
> [NC] As the filtering takes place in the IO thread. I have added some test
> cases covering
> various scenarios including ones where IO thread crashes while reading
> COMMIT/XID event.
> In this scenario if the filtering is ON the GTID is added to the ignored
> list (ign_gtids) on GTID
> event, so when the crash happens before COMMIT/XID event, that particular
> GTID gets ignored
> after the slave reconnects.

Ok, I'd forgotten about the ign_gtids.

So if I understand correctly, this means that when we ignore a transaction due
to filtered domain, the IO thread logs a GTID_LIST event to the relay log, and
the SQL thread then updates the slave position to after the ignored events.

So this means that an ignored event stays ignored, even if the domain id
filter is later removed, that is good.

Crash tests are always good and needed. However, the test I was asking for is
what happens when the configuration is changed:

CHANGE MASTER TO ignore_domain_ids=(2), master_use_gtid=current_pos;
START SLAVE;
# Replicate a number of events in domain 2.
STOP SLAVE;
CHANGE MASTER TO ignore_domain_ids=();
START SLAVE;
# Replicate some more events in domain 2 from master.
# Test that the events received before the filter was removed are still
# ignored, while those received after were applied.

>> > @@ -5601,6 +5610,10 @@
>> >      mi->last_queued_gtid= event_gtid;
>> >      mi->last_queued_gtid_standalone=
>> >        (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0;
>> > +
>> > +    /* Should filter all the subsequent events in the current GTID
>> group? */
>> > +    mi->domain_id_filter.set_filter(event_gtid.domain_id);
>>
>> So, here we will set m_filter to true, if we want to skip the following
>> group.
>>
>> But I don't see any code that will set it to false at the end of the
>> group?
>
>
> [NC] With the previous logic m_filter was set/reset only at GTID event. I
> have now added the logic
> to reset this flag on COMMIT/XID events too.

I guess you mean this code?

    if ((mi->using_gtid != Master_info::USE_GTID_NO) &&
        mi->domain_id_filter.is_group_filtered() &&
        is_commit_rollback)
    {
      mi->domain_id_filter.reset_filter();
    }

But that is not sufficient. This is only valid for GTIDs that have the
FL_STANDALONE flag clear. For GTIDs with FL_STANDALONE set, the event groups
ends after then first event for which Log_event::is_part_of_group() returns
false.

(There are two kinds of event groups. Some use BEGIN ... COMMIT or BEGIN
... XID; Others have just a single statement, possibly with INTVAR or so
events before, like CREATE TABLE ...).

>
>
>> That seems a problem, as we could then end up wrongly skipping all kinds of
>> other events, such as Gtid_list, Incident, ...
>>
>
> [NC] Right. I have added the following events types in the exception list :
> FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT,
> HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.

Hm, that seems quite error prone if a new event is added later and not added
to the list.

Can you use the Log_event::is_group_event() function to check? That function
is more likely to not be forgotten, and it seems all of its events should not
be ignored in case of domain id filter?

>> Also it is possible for eg. a master downgrade to introduce following event
>> groups that do not start with GTID - seems these would also be wrongly
>> skipped? So we also need a test case for this...
>>
>
> [NC] Right, its a valid case, but I couldn't find a way to repeat this
> scenario in mtr. Suggestions?

I guess like your test in rpl_domain_id_filter_old_master.test, where you copy
an old binlog file into the master server. To test this, just copy in a
master-bin.000002 with old events, when the last event in master-bin.000001 is
skipped by the slave due to domain filters.

> > For example, it seems to me that in non-GTID mode, if we restart the
> > server in
> > the middle of receiving an event group, we can easily end up with ignoring
> > one
> > half of the group and not the other, which is very bad?
> >
> 
> The filtering would only work when slave is configured with GTID.

What do you mean "only work"? That trying to use domain id filters in non-gtid
mode will give an error? Or that the filter will be ignored? Or that it will
just malfunction?

I think it is reasonable to say that domain filtering is only available in
GTID mode.

But then it needs to be an error if user tries to use the domain filter in
non-gtid mode.

Thus:

    CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(1)

should give an error. And setting just master_use_gtid=no when there is a
filter, or just setting a filter when master_use_gtid=no, both should give
errors as well.

Right?

And now I saw a commit "Added some more test cases to non-GTID mode." which
seems to add test cases to test domain id filters precisely when
MASTER_USE_GTID=no? So now I am rather confused as to what you mean with "only
work when slave is configured with GTID" ?

> > For non-GTID mode, maybe we need to handle the ignoring in the SQL thread
> > instead? Or alternatively, we can make ignoring events based on domain_id
> > only
> > legal in GTID mode, and give an error in non-GTID?
> >
> 
> I think the patch follows the 2nd approach minus the error. Could you
> please elaborate
> on the error?

If domain id filters are not available in non-GTID mode (master_use_gtid=no),
then the server needs to give an error if the user tries to configure CHANGE
MASTER to have domain id filters in non-gtid mode. Otherwise the user could be
unpleasantly surprised.

> > Why do you need to use ulong? Domain ids should always be uint32, also on
> > 64-bit.
> 
> > What is the real problem? Eg. why is there a test failure?
> 
> It was due to the different sizes for ulong on different platforms. While
> trying to reuse
> the (modified) auxiliary functions for IGNORE_SERVER_IDS where ulong is
> used for
> ids (server_id), uint32 store for domain_id did not work on 64-bit.

Ok, I see, this is to re-use change_master_id_cmp() between server id filter
and domain id filter. For some reason the code for server_id filter is using
ulong, that is what surprised me. Well, I suppose it doesn't matter much one
way or the other.


> ##### Case 7: Stop slave into the middle of a transaction being filtered and
> #             start it back with filtering disabled.
> 
> --echo # On master
> connection master;
> SET @@session.gtid_domain_id=1;
> BEGIN;
> INSERT INTO t2 VALUES(3);
> INSERT INTO t3 VALUES(3);
> sync_slave_with_master;

No, this does not work. Transactions are always binlogged as a whole on the
master, during COMMIT.

So to reliably test the slave I/O thread disconnecting in the middle of an
event group, you need to use some DBUG error injection. You can see an example
of doing this in rpl_gtid_reconnect.test, though here we may need a slightly
different DBUG injection, as we need to reconfigure the slave in the middle of
the group, not trigger an automatic reconnect.

But wouldn't it just be better to give an error if the user tries to use
domain id filters with MASTER_USE_GTID=NO ? It could be tricky to implement
correctly, and it doesn't seem an important use case?

In GTID mode, things are a bit simpler - because one can not change the
filters without stopping both SQL and IO thread. And in GTID mode, the I/O
thread deletes the old relay logs reconnecting after both slave threads have
been stopped, thus there is no opportunity to filter only half of an event
group when changing domain id filters.

 - Kristian.


Follow ups

References