maria-discuss team mailing list archive

Thread
Date

Re: Fix different gtid-positions on domain 0 in multi-master

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Reinder Cuperus <reinder@xxxxxxxxxxxxxx>
Date: Fri, 17 Mar 2017 11:50:26 +0100
Cc: maria-discuss@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87k27o5pe4.fsf@urd.knielsen-hq.org>
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Mar 17, 2017 at 08:33:07AM +0100, Kristian Nielsen wrote:
> Reinder Cuperus <reinder@xxxxxxxxxxxxxx> writes:
> 
> > The problem is, as soon as I stop that connection, that master2 and
> > master3 have different gtid-positions for domain0, and stop/start on
> > replication master3->backup results in the error:
> > "Got fatal error 1236 from master when reading data from binary log:
> > 'Error: connecting slave requested to start from GTID 0-1-3898746614,
> > which is not in the master's binlog'
> 
> Yes. backup sees that it is ahead of master3 in domain 0, so it aborts to
> avoid risk of diverging replication.
> 
> > I have tried moving master1/2 to domain_id:1, and removing the
> > domain_id:0 from the gtid_slave_pos on backup, but starting the
> > replication master2->backup results in the error:
> > Got fatal error 1236 from master when reading data from binary log:
> > 'Could not find GTID state requested by slave in any binlog files.
> > Probably the slave state is too old and required binlog files have been
> > purged.'
> 
> Yes. Because backup now sees that it is far behind in domain 0 (as it sees
> the world), and aborts to not silently lose transactions.
> 
> > I tried finding a way to purge domain:0 from master3/master4, but the
> > only way sofar I have found is doing a "RESET MASTER" on master3, which
> > would break replication between master3 and master4.
> 
> Yes, I guess this is what you need. You have made a copy and removed half of
> the data, and now you need to similarly remove half of the binlog. Even if
> there are no actual transactions left from a domain in non-purged binlogs,
> the binlogs still remember the history of all domains, in order to not
> silently lose transactions for a slave that gets far behind.
> 
> It would be useful in general to be able to purge a domain from a binlog.
> But currently the only way I can think of is RESET MASTER.
> 
> You can see how this binlog history looks by checking @@gtid_binlog_state,
> and in the GTID_LIST events at the head of each binlog file.

Thank you for this insightful information.

> > I have tried to find a way to insert an empty transaction, with the last
> > gtid on domain_id:0 on the master3, to bring master2/master3 in sync
> > again on that domain, but I could not find a way to do that on MariaDB.
> 
> The server will not binlog an empty transaction, but a dummy transaction
> should work, eg. create and drop a dummy table, for example:
> 
>   CREATE TABLE dummy_table (a INT PRIMARY KEY);
>   SET gtid_domain_id= 0;
>   SET gtid_server_id= 1;
>   SET gtid_seq_no= 3898746614;
>   DROP TABLE dummy_table;
> 
> Maybe this way you can make the binlogs look like they are in sync to the
> replication, not sure. It might be tricky, but then you do seem to have a
> good grasp of the various issues involved.

I have created a test-setup with Vagrant, and will use it to test this method.
Sofar this seems to be the sollution with the last impact. If at a later point
MariaDB gets the option of purging a domain, I'll use that to remove this
unused domain from all the machines.

> > Are there other ways to fix this issue, so I can have reliable
> > replication master3->backup without having to keep the dummy replication
> > backup->master3 indefinitely?
> 
> I guess you would need to stop traffic to master3/master4 while getting them
> in sync with one another and the do RESET MASTER on both and SET GLOBAL
> gtid_slave_pos="" to start replication from scratch. You would then also
> need to have server 'backup' up-to-date with master3 before RESET MASTER,
> and remove domain id 20 from the gtid_slave_pos on backup after the RESET
> MASTER.
> 
> So that is quite intrusive.

Stopping traffic to master3/4 might indeed be quite disruptive. I could imagine
the following sequence to be working in that case, assuming master3 is primary
in keepalived:
- master3> FLUSH TABLES WITH READ LOCK; STOP SLAVE;
- master4> wait till slave is quiet --> STOP SLAVE;
- master3> RESET MASTER; SET GLOBAL gtid_slave_pos="";
- master4> RESET MASTER; SET GLOBAL gtid_slave_pos="";
- master3> UNLOCK TABLES;
- master4> START SLAVE;
- master3> START SLAVE;

Properly scripted, and run at a moment when there are no long transactions and
no replication-lag, the impact would not be noticeable to the end-users. But
failure would mean broken replication, and a need to rebuild master4 causing a
temporary loss of redundancy.


Thank you for the information.

Regards,
Reinder Cuperus

References

Fix different gtid-positions on domain 0 in multi-master
From: Reinder Cuperus, 2017-03-16
Re: Fix different gtid-positions on domain 0 in multi-master
From: Kristian Nielsen, 2017-03-17