maria-developers team mailing list archive

Thread
Date
WL#188 New (by Knielsen): Using --log-slave-updates to ensure crash safe/transactional slave state

To: maria-developers@xxxxxxxxxxxxxxxxxxx
From: worklog-noreply@xxxxxxxxxxxx
Date: Mon, 21 Mar 2011 12:56:06 +0000 (UTC)
-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Using --log-slave-updates to ensure crash safe/transactional slave
		state
CREATION DATE..: Mon, 21 Mar 2011, 12:56
SUPERVISOR.....: 
IMPLEMENTOR....: 
COPIES TO......: 
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 188 (http://askmonty.org/worklog/?tid=188)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:



DESCRIPTION:

Overview
--------

A replication slave needs to preserve certain state between slave server
restarts to be able to correctly resume replication from where it left off.

In current MySQL replication, this state is kept in multiple simple files
(master.info, relay-log.info). This is a big problem if the slave server
crashes, as there is no guarantee that these files will be in a consistent
state with what is in the table data (and binlog if using --log-slow-updates),
or even with each other.

The Google patch rpl_transaction_enabled has a partial solution for this, by
duplicating the state inside InnoDB in a transactional way. At slave server
startup, InnoDB can then decide to overwrite the files keeping the slave state
with its own, hopefully more correct, information. There are some remaining
problems with this approach, some of which may be fixable, some not.

However, the basic problem here is the need to maintain state in a
transactional/crash-safe way across multiple subsystems of the server
(eg. replication and storage engine(s)). And we already have such a mechanism,
in the form of the two-phase commit between engines and binlog.

This worklog describes how we could use this existing mechanism to keep the
replication slave state across server restarts in a crash-safe way, rather
than introduce new complex mechanisms for every new piece of state to be kept.


Idea
----

The main state we need to make crash-safe is the binlog position
(filename,offset) of the next event from the master to execute on the slave.

There is more state stored current, but there is less need for that to be
transactional:

 - relay-log.info also stores the position in the relay log files on the slave
   which event execution has reached. In the case of normal shutdown this can
   be used file as-is. In case of recovery after a crash, it is possible that
   the relay logs are not consistent with the master and/or slave SQL thread,
   so it is probably better to just discard any existing relay log and
   re-fetch all necessary events from the master.

 - master.info mainly stores the connection information from CHANGE MASTER TO,
   which does not change often.

The basic idea is that on slave server start, we get the required information
from the binlog on the slave, rather than from these files (this requires that
--log-slave-updates is enabled). This also has the advantage that we avoid the
need to constantly update the state files after every transaction, saving some
execution cost.

When we shut down the server normally, we close the binlog in a way that
allows at startup to detect if we are recovering from a  crash or not. As part
of this close, we can write whatever state we need to recover at the end of
the binlog.

Recovering the state then depends on whether we crashed or not.

If we did not crash, then all we need is to able to find the position of
whatever event was written at the end of the binlog with the state.

One way is to just have a fixed offset from the end where this event starts;
however this is not too robust against finding wrong data there, especially
with respect to binlogs from different versions of the server possibly with
different events (or event sizes) at the end of the log.

Another way is that when we close the binlog, we seek back to the start of the
log and write the position of the last event there (either in the
format_description event, which already has version information, or a new
event written just after format_description event). We already do this seek
anyway to overwrite the flag in format_description event which signals that
the binlog has been closed properly (not crashed).

We just need to be sure to fsync() the binlog _before_ overwriting the
crashed-or-not flag, so we are sure all data will be there in the not-crashed
case.

If we did crash, then we can not rely on any information at the end of the
binlog. In this case, we can instead build on top of the already existing
crash recovery mechanism. This mechanism scans the last binlog to build a list
of all committed transactions, then uses this list to tell storage engines
which previously prepared transactions to commit and which to roll back. As
part of this scan, we can determine the last event executed, and from this we
get the necessary state to continue replication in terms of the binlog
position (filename,offset) on the master.

Note that if something like group ID (MWL#175) or global transaction ID is
implemented, then the state to preserve is the last executed ID rather than
binlog position; however, the basic mechanism remains the same.


Discussion
----------

The main disadvantage I see with this approach is that in order for this to be
really crash safe, we need to run with innodb_flush_log_at_trx_commit=1 and
sync_binlog=1. This requires 3 fsync() calls per commit, and group commit is
not possible since the slave is single-threaded. This is likely to be too
expensive for many installations to be able to use it.

However, it may be possible to reduce this overhead sufficiently by using
MWL#185 (grouping multiple commits together on the slave to reduce fsync()
overhead), or by implementing parallel replication to utilise group commit
(MWL#169, MWL#184, MWL#186).


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)