← Back to team overview

maria-developers team mailing list archive

Re: GTID replication and relay_log.info

 

Pavel Ivanov <pivanof@xxxxxxxxxx> writes:

> See logs in attachment. It looks clear to me that relay logs get replayed
> after crash and don't get deleted. I'm not sure though if the reason of
> that can be seen in the logs.

No, I agree it's not clear from the logs one way or the other, so good that
you mention this.

> Can you point me to the code where relay logs are deleted? I'll try to
> check why it's not called...

It is this, in slave.cc start_slave_threads():

  if (mi->using_gtid != Master_info::USE_GTID_NO &&
      !mi->slave_running && !mi->rli.slave_running)
  {
    purge_relay_logs(&mi->rli, NULL, 0, &errmsg);
    mi->master_log_name[0]= 0;
    mi->master_log_pos= 0;
  }

When both slave IO and SQL threads are stopped, we purge the relay logs before
starting. And surely they must be stopped when we first start them after a
crash ...

And then in get_master_version_and_clock(), if we deleted the relay logs then
we start from the GTID position:

  if (mi->using_gtid != Master_info::USE_GTID_NO && !mi->master_log_name[0])
  {
    ...

I would suggest printouts to the error log around those two places in the code
and check that both are executed as expected at startup. Maybe either if()
condition becomes false for some reason, or purge_relay_logs() fails to purge,
but I do not see how from the code at the moment.

It would also help if I had implemented MDEV-4491
(https://mariadb.atlassian.net/browse/MDEV-4491), to give better printouts in
the error log about where and how the slave is actually connecting to the
master. I will try to get this done soon.

One thing did spring to mind as I was looking at this code, the XtraDB
--innodb_recovery_update_relay_log option. This can overwrite relay-log
information during crash recovery, so might be able to interfere. But it is
off by default and does printouts to error log that are not seen in your logs,
so it seems this is not related ...

Thanks for helping look into this. I actually plan to change this code to
something more robust, but it would be good to find the real problem here,
otherwise I may end up just hiding the bug, not fixing it ...

 - Kristian.


Follow ups

References