maria-developers team mailing list archive

Thread
Date

Re: GTID and failovers with multi-domain replication

To: Pavel Ivanov <pivanof@xxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Wed, 08 May 2013 09:22:34 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAAG=WUuG+Gz=VYv_S+A-sWP7_KhbufQgKeL1oEmLqJUJ12hHrQ@mail.gmail.com> (Pavel Ivanov's message of "Tue, 7 May 2013 15:46:49 -0700")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

Pavel Ivanov <pivanof@xxxxxxxxxx> writes:

> assume we have server S1 that is master working with domain_id=0,
> server S2 is master working with domain_id=1, servers S3 and S4 are
> slaves and replicate from both of these masters, i.e. they have both
> domains in their databases. Now let's say S1 has last GTID 0-1-100, S2
> has last GTID 1-2-100. Before S3 and S4 were able to fully catch up
> with S1 and S2 power got cut out from S1 and S2. As replication from
> two masters goes independently it's possible that S3 will have last
> transactions 0-1-100, 1-2-99 while S4 will have last transactions
> 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to

Right, this will be a common situation.

> connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3
> to replicate from S4 in domain 1 and S4 to replicate from S3 in domain
> 0, and when they are equal in their position I can declare one of them

Yes, this is the idea.

> master for both domains. But it looks like there are no tools to do
> such operation.

Actually, I am implementing this right now, should have something working next
week.

The idea is to have START SLAVE UNTIL master_gtid_pos='xxx'.

To make S3 the new master, we temporarily point S3 to replicate from S4, and
do START SLAVE UNTIL master_gtid_pos='0-1-99,1-2-100'. This will replicate
1-2-100 to S3 and then stop. After this, S3 is strictly ahead of S4, and we
can continue with S3 the master and S4 the slave.

Note that S3 will ask to start at 0-1-100 but stop at 0-1-99. S4 will allow
this because it has the stop position 0-1-99 in the binlog - so there is no
problem that the start position 0-1-100 is missing. This requires support for
START SLAVE UNTIL master_gtid_pos, of course.

This is the general method to promote S1 as a master among slaves S1, S2, ...,
Sn:

 - Let X be the current GTID state of server S2. Temporarily point S1 to
   replicate from S2, execute START SLAVE UNTIL master_gtid_pos=X. Execute
   MASTER_GTID_WAIT(X), when this stops we know S1 is strictly ahead of S2.

 - Repeat with the remaining servers S3, S4, ..., Sn.

 - Now we know S1 is ahead of all other servers, so we can make it the new
   master and point the other slaves to replicate from it.

START SLAVE UNTIL master_gtid_pos is not available in the current code, but I
am implementing it now (and after that MASTER_GTID_WAIT()).

----

There is actually another possible answer, related to strict mode. In strict
mode, sequence numbers are always increasing. So it is safe to allow a slave
to connect to a master starting at a GTID not (yet) present in the master
binlog. If there really is a hole, we will give the error as soon as the hole
is reached (as we discussed in the previous mail).

So if we implement this, one could just connect S3 to S4 (and get no error),
wait for it to catch up, then make S3 master.

Not sure if it is a good idea to allow connect at a future GTID in strict
mode. It does seem to go a bit against the idea with "strict", on the other
hand the error is still caught later.

The main reason for giving the error in non-strict mode is to avoid that slave
asks for 0-1-3 in [0-1-1 0-1-2 0-2-1 0-2-2 0-2-3 ...] and ends up silently
doing nothing, endlessly skipping server_id=2 events waiting for 0-1-3 that
never shows up. This problem does not occur in strict mode, as it enforces
monotonic sequence numbers.

 - Kristian.

Follow ups

Re: GTID and failovers with multi-domain replication
From: Pavel Ivanov, 2013-05-08

References

GTID and failovers with multi-domain replication
From: Pavel Ivanov, 2013-05-07