← Back to team overview

maria-developers team mailing list archive

Re: Review of patch for MDEV-4820

 

Kristian,

I'm sorry for reviving this old thread, but I think it still doesn't
work correctly. So I took the latest 10.0-base (rev 3690) and started
to simulate different situations when slave is restored from backup
that is too old and thus it can't replicate from master. I've setup
servers S1 (server_id = 1) and S2 (server_id = 2) and in all tests I
make S1 master and S2 slave and I execute CHANGE MASTER TO ...
MASTER_USE_GTID = current_pos.

1. Set gtid_binlog_state and gtid_slave_pos to '0-1-10' on S1 and to
'0-1-1' on S2. Try to start slave on S2. I get the correct error
"Probably the slave state is too old".

2. Execute 3 transactions on S1, its gtid_current_pos is 0-1-13, start
slave on S2 (after CHANGE MASTER) it shows correct error "slave state
is too old" again.

3. Set gtid_binlog_state and gtid_slave_pos to '0-3-10' on S1 and to
'0-1-1' on S2. Try to start slave on S2. Now I get error "slave has
diverged". What gives? It's not diverged, it's just behind.

4. Now execute a couple transactions on S1, its gtid_current_pos is
0-1-12 now. Start slave on S2 (remember -- its gtid_current_pos is
0-1-1). And now I see even more confusing "The binlog on the master is
missing the GTID 0-1-1 requested by the slave (even though both a
prior and a subsequent sequence number does exist)". I'm sorry, which
prior sequence number exists?

Do you think you can fix these problems?


Pavel


On Sat, Aug 24, 2013 at 10:25 PM, Pavel Ivanov <pivanof@xxxxxxxxxx> wrote:
> Alright. I'd say if this is the only meaning current_pos should have
> then the name "current" is somewhat misleading.
> But ok, I'll set both gtid_binlog_state and gtid_slave_pos. It seems
> working so far.
>
> Pavel
>
> On Sat, Aug 24, 2013 at 1:00 AM, Kristian Nielsen
> <knielsen@xxxxxxxxxxxxxxx> wrote:
>> Pavel Ivanov <pivanof@xxxxxxxxxx> writes:
>>
>>> I took 10.0-base r3685. Started new just bootstrapped server with
>>> server_id = 1. It has @@global.gtid_binlog_pos,
>>> @@global.gtid_slave_pos and @@global.gtid_current_pos empty. Then I
>>> execute
>>>
>>> set global gtid_binlog_state = '0-10-10'
>>>
>>> After that @@global.gtid_binlog_pos = '0-10-10' as expected, but both
>>> @@global.gtid_slave_pos and @@global.gtid_current_pos are still empty.
>>> Because of that server won't be able to replicate from master.
>>> If I set gtid_binlog_state to '0-1-10' though
>>> @@global.gtid_current_pos changes to '0-1-10' and everything is fine.
>>
>> The short answer is that you should just set both gtid_slave_pos and
>> gtid_binlog_state on the new server.
>>
>>   SET GLOBAL gtid_binlog_state = '0-10-10';
>>   SET GLOBAL gtid_slave_state = @@GLOBAL.gtid_binlog_pos;
>>
>> For the longer answer, let me try to explain:
>>
>> The gtid_binlog_pos and the gtid_slave_pos are different concepts in
>> MariaDB. The former is the last GTID logged into the binlog (for each
>> domain). The latter is the last GTID replicated by the slave.
>>
>> These become different because on the one hand slave can use
>> --log-slave-updates=0 (so binlog is not updated), and on the other hand I did
>> not want to add overhead of updating gtid_slave_pos for every transaction on
>> the master. So a GTID that goes into one of them may or may not go into the
>> other.
>>
>> Now let us set up a slave with
>>
>>     CHANGE MASTER TO master_host= ... , master_use_gtid=slave_pos;
>>
>> The slave starts replication at the value of gtid_slave_pos. Every replicated
>> GTID updates gtid_slave_pos, so to switch master we can just point it to the
>> new host and it will continue from the correct point.
>>
>> But suppose we promote a new master, and later want the old master to to
>> become a slave. The old master did not update gtid_slave_pos, so the point at
>> which to start is the last GTID logged to the binlog, gtid_binlog_pos. Thus to
>> start the old master replicating a slave one should use:
>>
>>     SET GLOBAL gtid_slave_pos = @@GLOBAL.gtid_binlog_pos;
>>     CHANGE MASTER TO master_host= ... , master_use_gtid=slave_pos;
>>
>> and then things will proceed correctly with the new slave server.
>>
>> So this is how you should think of the variables. The gtid_slave_pos is the
>> position at which to start replication for a slave. The gtid_binlog_pos is the
>> last GTID logged into the binlog.
>>
>> Now, this creates an asymmetry - to switch a server to replicate from a new
>> master, the user has to know if the server was a master or a slave before, and
>> do it differently depending on which it is.
>>
>> So I wanted to provide a way to avoid this asymmetry, and I implemented CHANGE
>> MASTER TO master_use_gtid=current_pos for this. In this mode, when the slave
>> connects, it looks into both the gtid_slave_pos and the gtid_binlog_pos to
>> decide which of these has the most recent GTID - and then uses that GTID as
>> the point to start replication at.
>>
>> If server was a master before, then the last GTID in the binlog will have the
>> server's own server_id; _and_ the sequence number will be bigger that what is
>> in the gtid_slave_pos because sequence numbers on a master are always
>> generated bigger than any seen before. So in this case we use the last GTID in
>> the binlog to connect to. Otherwise we use the gtid_slave_pos.
>>
>> So that is _all_ that gtid_current_pos is - it is a way for the server to
>> guess whether it was a master or a slave before, and act accordingly. A bit of
>> magic for casual users that do not want to be aware of whether the server they
>> are setting up as a slave was a slave already before, or a master.
>>
>> So the point is that if you want to use gtid_current_pos on a newly setup
>> server, you need to provide correct values for _both_
>> gtid_binlog_pos/gtid_binlog_state _and_ gtid_slave_pos. Because
>> gtid_current_pos is the result of combining the two.
>>
>>> It looks like the problem is in the server_id check in the first loop
>>> in rpl_slave_state::iterate(). Can it be removed from there?
>>
>> I think so - in strict mode, the most recent GTID will always be the one with
>> the highest sequence number, so the server_id check is not needed. On the
>> other hand, if things are done correctly, the server_id check will make no
>> difference, as a GTID with different server_id cannot get into the binlog
>> without also getting into gtid_slave_pos
>>
>> But for now I have other, more critical things I want to fix first - I think
>> this is not a critical thing, just setting gtid_slave_pos on the new server
>> should make things work for you? (else let me know if I missed something).
>>
>>  - Kristian.


Follow ups

References