maria-discuss team mailing list archive

Thread
Date

Re: fsync alternative

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Marko Mäkelä <marko.makela@xxxxxxxxxxx>
Date: Fri, 13 Sep 2019 09:34:48 +0300
Cc: Maria Discuss <maria-discuss@xxxxxxxxxxxxxxxxxxx>, Jure Sah <dustwolfy@xxxxxxxxx>
In-reply-to: <87y2yz3lhy.fsf@urd.knielsen-hq.org>

Hi all,

First, sorry for a long quote before my response inline:

On Sat, Sep 7, 2019 at 11:22 PM Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
>
> Jure Sah <dustwolfy@xxxxxxxxx> writes:
>
> > It would appear that on a typical webserver, the majority of disk i/o
> > belongs to the MariaDB database process. It would appear it's mostly
> > waiting on fsync(), which to my understanding is executed at least once
> > per commit.
>
> Group commit can amortise this if there are multiple commits in parallel,
> but essentially yes.
>
> > I've also noticed in the documentation that the options to control fsync
> > usage are even more limited than in the MySQL server. They are also very
> > strongly argued against. Considering the point that InnoDB is considered
> > to be in an inconsistent state in any event, so long as the server is
> > not cleanly stopped, is there really justification for such strong
> > opposition here?
>
> Usually you can just set --sync-binlog=0 --innodb-flush-log-at-trx-commit=2
> and most fsync should be gone.
>
> > I understand that this is extensively researched in the documentation
> > and it has to do with the recovery of data in case of an unexpected
> > server reboot.
>
> InnoDB should recover safely in any case. But if binlog is enabled, the
> binlog is likely to become out of sync with the data inside innodb (in case
> of kernel crash/power outage. Mariadb process crash by itself is not enough
> to break recovery). So if the server is a replication master, slaves might
> need to be rebuilt.

Exactly. No matter which value of innodb_flush_log_at_trx_commit you
use, InnoDB should recover from a crash. In the event the database
server is killed, you can lose a few of the latest committed
transactions in the worst case.

The fsync() operations at transaction commit are responsible for
making durable the transaction state change (and any preceding
changes) by ensuring that all the operations up to the user
transaction commit will have been written to the write-ahead log (the
InnoDB redo log).

When the binlog is enabled, for recovery to work properly, the
binlog's record of transaction commits should never be behind storage
engines. The current default setting of sync_binlog=0 is actually
unsafe because of this.

There is some ongoing development work around this area. If the binlog
is enabled, it should actually be unnecessary to persist the storage
engine log, because it should be possible to replay any
not-committed-in-engine transactions from the binlog. We must merely
guarantee that the binlog is never behind the storage engine log, that
is, any write to the InnoDB redo log must be preceded by a fsync() of
the binlog. I think that this practically requires sync_binlog=1. Work
on this is being tracked in https://jira.mariadb.org/browse/MDEV-18959
(Engine transaction recovery through persistent binlog). For
transactions that involve at most one transactional storage engine, we
should even be able to get rid of the multi-phase commit mechanism.

But, InnoDB’s use of fsync() on data files feels like an overkill. I
believe that we only need some 'write barriers', that is, some
interlocking between log and page writes to prevent some writes from
being reordered. I think that we need the following: (1) Ensure strict
write-ahead logging: Guarantee that writes to log are completed before
the corresponding data pages are written. fsync() on the log file does
this, but it is overkill for this. (2) For most page writes
('background flushing'), we do not care when exactly they completed,
and we do not need or want fsync(). (3) On significant transaction
state change (COMMIT, or a transition of XA PREPARE to XA ROLLBACK),
we must ensure that the log record for that (and any preceding log)
will be durably written to the log file(s). This probably calls for
fsync(). (4) On log checkpoint (logically discarding the start of the
write-ahead log), we must ensure that all the pages that were referred
to be to-be-discarded section of log will have been written to disk.
This could be the only case where we might actually need fsync() on
data files. Preferably it should be done asynchronously. Log
checkpoint must also fsync() the log file(s) to make the checkpoint
metadata durable.

Also, the InnoDB doublewrite buffer is a work-around for the operating
system kernel not supporting atomic writes of data pages (even though
to my understanding, it should be technically doable on any
journal-based or copy-on-write file system).

This week, there was a Linux Plumbers Conference, with a Databases
Microconference
https://linuxplumbersconf.org/event/4/page/34-accepted-microconferences#db
where both the atomic writes and fsync() were discussed. I hope that
Sergei Golubchik or Daniel Black can report more on that. In any case,
it might take years before these improvements become available in
Linux distributions; with luck, something will happen soon and be
backported to the kernels in stable or long-term-support
distributions.

With best regards,

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation

Follow ups

Re: fsync alternative
From: Daniel Black, 2019-09-16
Re: fsync alternative
From: Kristian Nielsen, 2019-09-13

References

fsync alternative
From: Jure Sah, 2019-09-07
Re: fsync alternative
From: Kristian Nielsen, 2019-09-07