← Back to team overview

maria-discuss team mailing list archive

Re: Long email about a replication issue

 

<Rhys.Campbell@xxxxxxxxxxxx> writes:

> This weekend I had to repair a replication problem on one of our
> clusters. I've attempted to get to the root cause but not sure where I

Is this pure MariaDB replication, or is it Galera? I think it is the former,
but the term "cluster" is somewhat overloaded, which is why I ask...

> The setup is M1<->M2 (with attached slaves). M1 is the active master
> receiving all writes. Access is controlled through an F5 and I don't
> think any errant transactions have occurred on the inactive master
> (M2). I've checked this by grepping the binlogs for the M2 server_id.
>
> The initial associated record that broke replication was attached to a
> "user" table record. This user was created on Friday at
> 16:21PM. Replication broke around 11:30PM that night. The user record
> had a GTID of GTID 0-1-36823254 (recovered from M1)
>
> I've looked into the appropriate binlog from M2...

> If I grep for the specific GTID on M2 I get nothing...

> If I grep for this record by email address I also get nothing. So I
> must conclude this record (and a bunch of others), never got to master

> until replication broke due to the FK errors. You would expect
> replication to break here because of a gap in the GTIDs. This did not
> happen and I'm almost certain that GTID replication could not have
> been deactivated and the positions messed around with.

Yeah, even if the slave was set to MASTER_USE_GTID=no, the GTIDs should
still have been there in the M2 binlog.

> I'm unsure of where to go now. Any ideas? Any thoughts are appreciated.

I guess you need to figure out why M2 did not apply those transactions. Some
suggestions:

 - Check the error log on M2 for disconnect/reconnects around the time of
   the transactions that are missing (or any disconects/reconnects). Such
   messages should also say at what position M2 disconnected and
   reconnected, this could be compared to the problem GTID. This could show
   if transactions were skipped because of reconnecting at a wrong position.

 - Also check for local slave stop/start message in the M2 error log, to see
   if anything looks related or could indicate changes in the replication
   config (most replication changes require stopping the slave threads).

 - You can also check the binlog on M1 for any out-of-order GTIDs, which
   could cause problems at slave reconnect (seems unlikely though).

 - Replication filtering could cause this - double-check that no filtering
   was turned on or something. Also stuff like --gtid-ignore-duplicates.

Good luck,

 - Kristian.


Follow ups

References