← Back to team overview

maria-developers team mailing list archive

Slave can take a very long time to start replication

 

Kristian,

As I understand currently when slave connects to master and wants to
start replicating it passes GTID to start from, master finds binlog
file where the earliest GTID is located and then scans through that
file to find the exact binlog position to start sending binlog events
from. If this binlog file is pretty big then scanning can take a very
long time. I guess especially long when several slaves try to start
replicating roughly at the same time. We observed 60-90 seconds
between initial connection by the slave and the first real binlog
events starting to flow. In this period of time slave doesn't receive
anything from master and thus it's very easy to confuse such situation
with connection loss, hit slave_net_timeout, reopen connection to the
master again and force it to start searching through binlog file from
the very beginning... Putting aside the argument of what value is good
enough for slave_net_timeout, I'd say in any case slave taking 60
seconds to just start receiving binlog events from master is
unacceptable.

Did you think about this problem before? Maybe you've even planned
already to implement some solution for this?


Thank you,
Pavel


Follow ups