← Back to team overview

maria-discuss team mailing list archive

Re: Is disabling doublewrite safe on ZFS?


Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
There is at least one case I know where you do not need doublewrite
buffer. And you even do not need CoW filesystem.

A combination of OS guarantee of atomic writes if they are
sector-sized writes, and matching innodb page size being. If you have
disks with 4K sectors (quite common), and you chose
innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use
Windows as your OS (because this one provides guarantees that
single-sector sized/aligned writes are atomic as per
[1]), then you can safely disable innodb-doublewrite. You do not need
"supported hardware" for that.

lets suppose mysqld crashes during the copy from its internal buffer and the OS write cache, ending with only partial data being transferred (ie: 2K data on a 4Kn disk). If using direct writes (or FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the underlying disk throwing an I/O error. But what about non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?

As for Linux, I think Marko tested what happens when process is
getting killed, and sure enough, it can be killed in the middle of a
larger write, and have partially written data. I suspect that O_DIRECT
and sector-sized writes might be atomic ( as in Windows example), but
I did not find any written confirmation for that. Someone with better
understanding of kernel and filesystems could prove or disprove this

Yes, O_DIRECT + single sector aligned write *should* be atomic, supposing the disk rejects the partial write. However, this really is an hardware-specific condition. Back to ZFS: the entire record *will* be written atomically. As a first approximation, when recordsize == innodb page size, doublewrite should not be needed. However, as stated above, what will happen if the mysqld process is killed at the wrong moment?

I fear something as:
- InnoDB pagesize and ZFS recordsize are both at 16K;
- InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS does not support O_DIRECT, by the way); - mysqld crashes at the worst possible moment, so only 1/2 of InnoDB internal data (8K) was written by write(); - ZFS received the partial 8K data, but it does *not* know these are partial data only (ie: it "see" a normal 8K write);
- some seconds later, partial data are commited to stable storage;
- when mysqld restarts, InnoDB complains about partial page write.

This bring another question: how will InnoDB behave after detecting a partial page write? Will it shut down itself?

Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8

Follow ups