maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #03681
WL#164 New (by Knielsen): Extend crash recovery to recover non-prepared transactions from binlog
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Extend crash recovery to recover non-prepared transactions from binlog
CREATION DATE..: Wed, 27 Oct 2010, 13:08
SUPERVISOR.....:
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 164 (http://askmonty.org/worklog/?tid=164)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, when an xa-capable storage engine commits a transaction that is
also written to the binary log, we do three fsync() calls, which is
expensive. One in prepare(), one for binlog write(), and one for commit().
These multiple fsync()'s are needed to ensure that crash recovery can recover
into a consistent state after a crash that happens during commit, so that any
transaction that becomes committed in the storage engine is also committed in
the binlog, and vice versa. This is essential for replication, as otherwise a
crashed master may not be able to recover into a usable state without a full
restore from backup.
The fsync()s are also needed to ensure durability, that is, any transaction
that is reported to the client as having committed, will also remain committed
after a crash.
This worklog is about optimising this so that we can reliably recover into a
consistent state (and preserve durability) with just a single fsync(), the one
for the binary log. In effect, this will allow to run with
--innodb-flush-log-at-trx-commit=2, or even =0, and still preserve both
durability of InnoDB commits and consistent crash recovery.
The idea is that along with every commit, InnoDB will store also an
identification of the transactions associated with that commit (in a
transaction-safe way, eg. using the InnoDB transaction log). In fact, InnoDB
already does this, since it records the binlog position of the last commit
done at any point in time.
Then during crash recovery, we can do as follows:
- First let each engine recover into an engine-consistent state. This may
be missing some transactions due to --innodb-flush-log-at-trx-commit=0.
- Ask each transactional engine for the ID of the last transaction committed.
- By virtue of MWL#116, we know that the order of commits in the storage
engine and in the binlog is the same. Thus, the transactions missing in the
engine is exactly those that are written in the binlog after the last
transaction committed in the engine.
- Recover the missing transactions in the engine by re-playing the binary log
from that point to the end (similar to how it would be applied on a slave.)
There are some restrictions/requirements:
- This will only work for engines that implement the MWL#116 commit_ordered()
handlerton method.
- After a crash, different engines (eg. PBXT vs. InnoDB) may be in different
states (different last transaction committed). It will be necessary to
recover each engine separately, at least up to a point where they are
consistent. This should be simple enough for transactions that only involve
tables for one engine. For cross-engine transactions, it can probably work
for row-based replication, simply by ignoring updates to tables in other
engines when applying the binlog events. But for statement-based
replication it will not work easily [*].
This way, it should be possible to greatly reduce the fsync() overhead when
using the binlog with a transactional engine.
[*] One could imagine trying to make it work even for statement-based binlog
events, at least for MVCC engines. When running the updates against one
engine, any select against the other engine could try to use an MVCC snapshow
corresponding to the transaction that corresponds to the appropriate
transaction back in time. But I think this is not realistic / worth it to
implement.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)