← Back to team overview

maria-developers team mailing list archive

Re: [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

 

Simon, thanks for your detailed answer.

I see your point on having access to powerful tools when they are needed,
even when such tools can be dangerous when used incorrectly. It reminds me
of the old "goto considered harmful" which I never agreed with.

It occurs to me that there are actually implicitly two distict features
being discussed here.

One is: Forget this domain_id, it has been unused since forever, but do
check that it actually _is_ unused to avoid mistakes (the check would be
that the domain is already absent from all available binlog files). This is
the one I originally had in mind.

Another is: There is this domain_id in the old binlog files, it is causing
problems, we need to recover and we know what we are doing. I think this is
the one you have in mind in what you write, and it seems very valid as well.

It helped me to think of them explicitly as two distinct features.

Also, Andrei's suggestion to fix IGNORE_DOMAIN_IDS to be able to connect to
a master with that domain and completely ignore it seems useful for some of
the scenarious you mention.

> Imagine a replication chain of M[aster] —> S[lave]1, S[lave]2, A[ggregate]1
> and A[ggregate]1 —> A[ggregate]2 , A[ggregate]3, ….

> If M dies and say A1 happens to be more up to date than S1, S2 then we may want to promote
> A1 to be the new master, and move S1, S2 under A1, move A2 under A1
> (but promote as the aggregate writeable master),
> and move A3 under A2. This would not be the “desired” setup as probably we’d end
> up thowing away all the aggregate data on A1.

Right, I see. Throwing away table data needs matching editing of the binlog
history to give a consistent replication state. And indeed in a failover
scenario, waiting for logs to be purged/purgeable does not seem appropriate.

> In this specific case it may be you really do want to hide the 2 sets
> of domains and only show one
> to the S1, S2 boxes, but maintain 2 domains on A2, A3.

Agree. So a fixed IGNORE_DOMAIN_IDS would seem helpful here.

> It depends but in my opinion in most cases letting replication flow is more
> important than having 100% master and slave consistency. The longer the
> slave is stopped the more differences there are.
>
> And when you get in a situation like this you’re very tempted to go back to
> binlog file plus position, to scan the bin logs with tools like mysqlbinlog
> and do it the old way like we used to do years ago.  This is tedious and error
> prone but if you’re careful it works fine. The whole idea of GTID is to avoid
> the DBA ever having to do this…

Right. Though once multiple domains are involved, the binlog is effectively
multiple streams, and using the old-style single file/offset position may be
tricky.

But if IGNORE_DOMAIN_IDS works for master connection as well, then the slave
has the ability to say exactly which domains it wants to see, and exactly
where in each of those domains it wants to start (gtid_slave_pos), so that
should be quite flexible.

When I designed GTID I actually had this very much in mind, to allow GTID to
be a full replacement for the old style of replication and to allow to do
what is needed to solve the problem at hand. For example, this is why the
code tries so hard to deal with out-of-order GTID sequence numbers (as
opposed to just refusing to ever operate with those).

On the other hand, it was also a goal to be much more consistent and strict
and try to prevent silent failures and inconsistencies. These two goals tend
to get in conflicts in some areas. Hence for example the gtid_strict_mode.

There are still a few features that were never implemented but should have
been (like DELETE DOMAIN and binlog indexes for example), and it is surely
not perfect.

> So I see the DELETE DOMAIN (MariaDB) or “remove old UUID” (MySQL) type request
> to be one that means the master will only pretend that it can serve or knows about
> the remaining domains or UUIDs and if the slaves are sufficiently up to date they
> really don’t care as their vision will be similar.  Such a command would be replicated,
> right? It has to be for the slaves to change “their view” at the same moment
> in replication (not necessarily time) as the master.

Hm, good point about whether it will be replicated.

FLUSH LOGS is replicated by default with an option not to, so a DELETE
DOMAIN would be also, I suppose. This makes it seem even more dangerous,
frankly. Imagine an active domain being deleted by mistake, now the mistake
immediately propagates to all servers in the replication topology, ouch.

Maybe there should be an option, for example

  FLUSH BINARY LOGS DELETE DOMAIN 10 NOCHECK

or

  FLUSH BINARY LOGS DELETE DOMAIN 10 ALLOW ACTIVE

or something.
Note that the effect of deleting a domain is basically to add at the head of
the binlog a mark that says the domain never existed. All of the old binlog
is unchanged. So the command does not really immediately affect running
replication, only new slave re-connections.

Hope this helps,

 - Kristian.


Follow ups

References