← Back to team overview

maria-developers team mailing list archive

Re: [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera

 

Hi!

On Mon, Oct 27, 2014 at 9:12 AM, Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
wrote:

> Nirbhay Choubey <nirbhay@xxxxxxxxxxx> writes:
>
> > [NC] You are right, setting both DO_ and IGNORE_ at the same time would
> not
> > make any sense.
> > But I actually followed the pattern of the existing replication, where
> both
> > _do_ and _ignore_
> > filters are allowed to hold values at the same time and _do_ is preferred
> > in case both are set.
>
> But if it does not make any sense, why allow it? Not giving an error will
> just
> risk confuse users, there seems no benefit.
>

Right.


>
> So I still think we need an error.
>

Ok. CHANGE MASTER would now fail when both DO_ and INGORE_ are non-empty
at the same time or when MASTER_USE_GTID=no with non-empty DO_ or IGNORE_.


> >>  - Add a test for the following scenario when using GTID mode: Some
> domain
> >> id
> >>    is ignored, some events are received from the master and ignored.
> Then
> >> the
> >>    slave is stopped, and the ignore rule is removed. What happens when
> the
> >>    slave is restarted and reconnects? I think the previously ignored
> >> domain id
> >>    events will now be re-fetched and replicated. I guess that's ok, but
> it
> >>    should be documented. Also test what happens when binlog files on the
> >>    master have been rotated and purged, so that the events are no longer
> >>    available on the master (an error should occur).
> >>
> >
> > [NC] As the filtering takes place in the IO thread. I have added some
> test
> > cases covering
> > various scenarios including ones where IO thread crashes while reading
> > COMMIT/XID event.
> > In this scenario if the filtering is ON the GTID is added to the ignored
> > list (ign_gtids) on GTID
> > event, so when the crash happens before COMMIT/XID event, that particular
> > GTID gets ignored
> > after the slave reconnects.
>
> Ok, I'd forgotten about the ign_gtids.
>
> So if I understand correctly, this means that when we ignore a transaction
> due
> to filtered domain, the IO thread logs a GTID_LIST event to the relay log,
> and
> the SQL thread then updates the slave position to after the ignored events.
>
> So this means that an ignored event stays ignored, even if the domain id
> filter is later removed, that is good.
>

Yes.


>
> Crash tests are always good and needed. However, the test I was asking for
> is
> what happens when the configuration is changed:
>
> CHANGE MASTER TO ignore_domain_ids=(2), master_use_gtid=current_pos;
> START SLAVE;
> # Replicate a number of events in domain 2.
> STOP SLAVE;
> CHANGE MASTER TO ignore_domain_ids=();
> START SLAVE;
> # Replicate some more events in domain 2 from master.
> # Test that the events received before the filter was removed are still
> # ignored, while those received after were applied.
>

I have added a test covering this scenario.


>
> >> > @@ -5601,6 +5610,10 @@
> >> >      mi->last_queued_gtid= event_gtid;
> >> >      mi->last_queued_gtid_standalone=
> >> >        (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0;
> >> > +
> >> > +    /* Should filter all the subsequent events in the current GTID
> >> group? */
> >> > +    mi->domain_id_filter.set_filter(event_gtid.domain_id);
> >>
> >> So, here we will set m_filter to true, if we want to skip the following
> >> group.
> >>
> >> But I don't see any code that will set it to false at the end of the
> >> group?
> >
> >
> > [NC] With the previous logic m_filter was set/reset only at GTID event. I
> > have now added the logic
> > to reset this flag on COMMIT/XID events too.
>
> I guess you mean this code?
>
>     if ((mi->using_gtid != Master_info::USE_GTID_NO) &&
>         mi->domain_id_filter.is_group_filtered() &&
>         is_commit_rollback)
>     {
>       mi->domain_id_filter.reset_filter();
>     }
>
> But that is not sufficient. This is only valid for GTIDs that have the
> FL_STANDALONE flag clear. For GTIDs with FL_STANDALONE set, the event
> groups
> ends after then first event for which Log_event::is_part_of_group() returns
> false.
>
> (There are two kinds of event groups. Some use BEGIN ... COMMIT or BEGIN
> ... XID; Others have just a single statement, possibly with INTVAR or so
> events before, like CREATE TABLE ...).
>

I see. Fixed.


>
> >
> >
> >> That seems a problem, as we could then end up wrongly skipping all
> kinds of
> >> other events, such as Gtid_list, Incident, ...
> >>
> >
> > [NC] Right. I have added the following events types in the exception
> list :
> > FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT,
> > HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.
>
> Hm, that seems quite error prone if a new event is added later and not
> added
> to the list.
>
> Can you use the Log_event::is_group_event() function to check? That
> function
> is more likely to not be forgotten, and it seems all of its events should
> not
> be ignored in case of domain id filter?
>

Right. Done!


>
> >> Also it is possible for eg. a master downgrade to introduce following
> event
> >> groups that do not start with GTID - seems these would also be wrongly
> >> skipped? So we also need a test case for this...
> >>
> >
> > [NC] Right, its a valid case, but I couldn't find a way to repeat this
> > scenario in mtr. Suggestions?
>
> I guess like your test in rpl_domain_id_filter_old_master.test, where you
> copy
> an old binlog file into the master server. To test this, just copy in a
> master-bin.000002 with old events, when the last event in
> master-bin.000001 is
> skipped by the slave due to domain filters.
>

Now that DO_ & IGNORE_ can''t be used with MASTER_USE_GTID=no, using
CURRENT_POS/SLAVE_POS causes this old binlog file to skip. So, I had to
remove this test case.


> > > For example, it seems to me that in non-GTID mode, if we restart the
> > > server in
> > > the middle of receiving an event group, we can easily end up with
> ignoring
> > > one
> > > half of the group and not the other, which is very bad?
> > >
> >
> > The filtering would only work when slave is configured with GTID.
>
> What do you mean "only work"? That trying to use domain id filters in
> non-gtid
> mode will give an error? Or that the filter will be ignored? Or that it
> will
> just malfunction?
>

It meant that in non-GTID mode, the domain_id filters with simply get
ignored.
But, following your suggestion, I have now modified the log to give error if
domin_id filter is set with MASTER_USE_GTID=no.


> I think it is reasonable to say that domain filtering is only available in
> GTID mode.
>

Right.


>
> But then it needs to be an error if user tries to use the domain filter in
> non-gtid mode.
>
> Thus:
>
>     CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(1)
>
> should give an error. And setting just master_use_gtid=no when there is a
> filter, or just setting a filter when master_use_gtid=no, both should give
> errors as well.
>
> Right?
>

Yes. I have also added some tests covering various combinations.


>
> And now I saw a commit "Added some more test cases to non-GTID mode." which
> seems to add test cases to test domain id filters precisely when
> MASTER_USE_GTID=no? So now I am rather confused as to what you mean with
> "only
> work when slave is configured with GTID" ?
>

non-gtid mode tests were to make sure replication happens "normally" in
non-gtid mode
even when domain_id filters were set.

But now since this combination (domain_id filters in non-gtid mode) is not
valid any more,
I have removed the non-gtid tests.


>
> > > For non-GTID mode, maybe we need to handle the ignoring in the SQL
> thread
> > > instead? Or alternatively, we can make ignoring events based on
> domain_id
> > > only
> > > legal in GTID mode, and give an error in non-GTID?
> > >
> >
> > I think the patch follows the 2nd approach minus the error. Could you
> > please elaborate
> > on the error?
>
> If domain id filters are not available in non-GTID mode
> (master_use_gtid=no),
> then the server needs to give an error if the user tries to configure
> CHANGE
> MASTER to have domain id filters in non-gtid mode. Otherwise the user
> could be
> unpleasantly surprised.
>

Right, fixed.


>
> > > Why do you need to use ulong? Domain ids should always be uint32, also
> on
> > > 64-bit.
> >
> > > What is the real problem? Eg. why is there a test failure?
> >
> > It was due to the different sizes for ulong on different platforms. While
> > trying to reuse
> > the (modified) auxiliary functions for IGNORE_SERVER_IDS where ulong is
> > used for
> > ids (server_id), uint32 store for domain_id did not work on 64-bit.
>
> Ok, I see, this is to re-use change_master_id_cmp() between server id
> filter
> and domain id filter. For some reason the code for server_id filter is
> using
> ulong, that is what surprised me. Well, I suppose it doesn't matter much
> one
> way or the other.
>

ok.


>
>
> > ##### Case 7: Stop slave into the middle of a transaction being filtered
> and
> > #             start it back with filtering disabled.
> >
> > --echo # On master
> > connection master;
> > SET @@session.gtid_domain_id=1;
> > BEGIN;
> > INSERT INTO t2 VALUES(3);
> > INSERT INTO t3 VALUES(3);
> > sync_slave_with_master;
>
> No, this does not work. Transactions are always binlogged as a whole on the
> master, during COMMIT.
>

You are right. My original intent was to test a transaction which modifies
both MyISAM and
InnoDB tables, where first modification is done in MyISAM table. In which
case the changes
to MyISAM is sent to the slave right away, while rest of trx is sent on
commit. I have modified
the test accordingly.


> So to reliably test the slave I/O thread disconnecting in the middle of an
> event group, you need to use some DBUG error injection. You can see an
> example
> of doing this in rpl_gtid_reconnect.test, though here we may need a
> slightly
> different DBUG injection, as we need to reconfigure the slave in the
> middle of
> the group, not trigger an automatic reconnect.
>
> But wouldn't it just be better to give an error if the user tries to use
> domain id filters with MASTER_USE_GTID=NO ? It could be tricky to implement
> correctly, and it doesn't seem an important use case?
>

I agree.


>
> In GTID mode, things are a bit simpler - because one can not change the
> filters without stopping both SQL and IO thread. And in GTID mode, the I/O
> thread deletes the old relay logs reconnecting after both slave threads
> have
> been stopped, thus there is no opportunity to filter only half of an event
> group when changing domain id filters.
>

Right.

Best,
Nirbhay

Follow ups

References