← Back to team overview

maria-discuss team mailing list archive

Re: fsync alternative

 

On Fri, 13 Sep 2019 09:34:48 +0300
Marko Mäkelä <marko.makela@xxxxxxxxxxx> wrote:

> But, InnoDB’s use of fsync() on data files feels like an overkill. I
> believe that we only need some 'write barriers', that is, some
> interlocking between log and page writes to prevent some writes from
> being reordered. I think that we need the following: (1) Ensure strict
> write-ahead logging: Guarantee that writes to log are completed before
> the corresponding data pages are written. fsync() on the log file does
> this, but it is overkill for this. (2) For most page writes
> ('background flushing'), we do not care when exactly they completed,
> and we do not need or want fsync(). (3) On significant transaction
> state change (COMMIT, or a transition of XA PREPARE to XA ROLLBACK),
> we must ensure that the log record for that (and any preceding log)
> will be durably written to the log file(s). This probably calls for
> fsync(). (4) On log checkpoint (logically discarding the start of the
> write-ahead log), we must ensure that all the pages that were referred
> to be to-be-discarded section of log will have been written to disk.
> This could be the only case where we might actually need fsync() on
> data files. Preferably it should be done asynchronously. Log
> checkpoint must also fsync() the log file(s) to make the checkpoint
> metadata durable.
> 
> Also, the InnoDB doublewrite buffer is a work-around for the operating
> system kernel not supporting atomic writes of data pages (even though
> to my understanding, it should be technically doable on any
> journal-based or copy-on-write file system).
> 
> This week, there was a Linux Plumbers Conference, with a Databases
> Microconference
> https://linuxplumbersconf.org/event/4/page/34-accepted-microconferences#db
> where both the atomic writes and fsync() were discussed. I hope that
> Sergei Golubchik or Daniel Black can report more on that. In any case,
> it might take years before these improvements become available in
> Linux distributions; with luck, something will happen soon and be
> backported to the kernels in stable or long-term-support
> distributions.

Atomic writes, the direction we where pointed to is the XFS CoW (reflink=1) implementation. I haven't found the kernel interfaces required to use this. SQLite uses a f2fs ioctl extension.

An alternative talked about using NVDIMM space attached as a helper space for filesystems to use however all filesystems would be required to implement support. Of course users have to have NVDIMM first. At one stage Seagate had hard disks with NVDIMM on them, unsure if this continues to be the case.

If devices ever supported atomic writes this would require implementing these changes all from filesystem the way down (though pseudo block devies like device mapper (aka LVM) and crypt layers) to block layers would be required otherwise.

On write barriers this seemed like an enhancement to io_uring as I followed the discussion but I'm waiting for the video to relook to see if I missed anything. I can't see much of a chance of a backport (io_uring was first added in 5.1 kernel) here however Ubuntu 20.04 LTS might happen if it gets defined/implemented/tested quickly enough.


On Fri, 13 Sep 2019 21:08:13 +0200
Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> wrote:

> > But, InnoDB’s use of fsync() on data files feels like an overkill. I
> > believe that we only need some 'write barriers', that is, some  
> 
> This is also quite interesting. My (admittedly limited) understanding is
> that disks in fact have write-barrier functionality,

From a major disk vendor in the LPC Database Microconference session, SCSI had ordering as an option, however it was never implemented by any vendor.

Without this existing in hardware I think the discussion went along the lines that it needs to wait until the hardware queue is fully flushed. (lots of hardware specification acronyms where mentioned in quick succession)



Follow ups

References