← Back to team overview

maria-developers team mailing list archive

Re: GTID and failovers with multi-domain replication

 

I love the idea with START SLAVE UNTIL. Looks very clean and
reasonable. And I don't like the idea of special treatment of this in
strict mode.

> Not sure if it is a good idea to allow connect at a future GTID in strict
> mode. It does seem to go a bit against the idea with "strict", on the other
> hand the error is still caught later.

This is exactly the problem. Strict mode should be about a strict
discipline on the dba's side. If he connects S3 to replicate from S4
just for the sake of catch-up and he intends to make S3 master later
then he must say that explicitly by issuing the command START SLAVE
UNTIL. If he issues regular START SLAVE that may mean that he really
wants S4 to be a master and doesn't intend to switch later. And then
he will be surprised why replication doesn't progress.

Pavel

On Wed, May 8, 2013 at 12:22 AM, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Pavel Ivanov <pivanof@xxxxxxxxxx> writes:
>
>> assume we have server S1 that is master working with domain_id=0,
>> server S2 is master working with domain_id=1, servers S3 and S4 are
>> slaves and replicate from both of these masters, i.e. they have both
>> domains in their databases. Now let's say S1 has last GTID 0-1-100, S2
>> has last GTID 1-2-100. Before S3 and S4 were able to fully catch up
>> with S1 and S2 power got cut out from S1 and S2. As replication from
>> two masters goes independently it's possible that S3 will have last
>> transactions 0-1-100, 1-2-99 while S4 will have last transactions
>> 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to
>
> Right, this will be a common situation.
>
>> connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3
>> to replicate from S4 in domain 1 and S4 to replicate from S3 in domain
>> 0, and when they are equal in their position I can declare one of them
>
> Yes, this is the idea.
>
>> master for both domains. But it looks like there are no tools to do
>> such operation.
>
> Actually, I am implementing this right now, should have something working next
> week.
>
> The idea is to have START SLAVE UNTIL master_gtid_pos='xxx'.
>
> To make S3 the new master, we temporarily point S3 to replicate from S4, and
> do START SLAVE UNTIL master_gtid_pos='0-1-99,1-2-100'. This will replicate
> 1-2-100 to S3 and then stop. After this, S3 is strictly ahead of S4, and we
> can continue with S3 the master and S4 the slave.
>
> Note that S3 will ask to start at 0-1-100 but stop at 0-1-99. S4 will allow
> this because it has the stop position 0-1-99 in the binlog - so there is no
> problem that the start position 0-1-100 is missing. This requires support for
> START SLAVE UNTIL master_gtid_pos, of course.
>
> This is the general method to promote S1 as a master among slaves S1, S2, ...,
> Sn:
>
>  - Let X be the current GTID state of server S2. Temporarily point S1 to
>    replicate from S2, execute START SLAVE UNTIL master_gtid_pos=X. Execute
>    MASTER_GTID_WAIT(X), when this stops we know S1 is strictly ahead of S2.
>
>  - Repeat with the remaining servers S3, S4, ..., Sn.
>
>  - Now we know S1 is ahead of all other servers, so we can make it the new
>    master and point the other slaves to replicate from it.
>
> START SLAVE UNTIL master_gtid_pos is not available in the current code, but I
> am implementing it now (and after that MASTER_GTID_WAIT()).
>
> ----
>
> There is actually another possible answer, related to strict mode. In strict
> mode, sequence numbers are always increasing. So it is safe to allow a slave
> to connect to a master starting at a GTID not (yet) present in the master
> binlog. If there really is a hole, we will give the error as soon as the hole
> is reached (as we discussed in the previous mail).
>
> So if we implement this, one could just connect S3 to S4 (and get no error),
> wait for it to catch up, then make S3 master.
>
> Not sure if it is a good idea to allow connect at a future GTID in strict
> mode. It does seem to go a bit against the idea with "strict", on the other
> hand the error is still caught later.
>
> The main reason for giving the error in non-strict mode is to avoid that slave
> asks for 0-1-3 in [0-1-1 0-1-2 0-2-1 0-2-2 0-2-3 ...] and ends up silently
> doing nothing, endlessly skipping server_id=2 events waiting for 0-1-3 that
> never shows up. This problem does not occur in strict mode, as it enforces
> monotonic sequence numbers.
>
>  - Kristian.


Follow ups

References