← Back to team overview

maria-discuss team mailing list archive

Re: Is disabling doublewrite safe on ZFS?

 

On 21/08/2018 11:52, Marko Mäkelä wrote:
I believe that the Linux kernel can interrupt any write at 4096-byte
boundaries when a signal is delivered to the process.
I am curious: Where was it claimed that data=journal guarantees atomic
writes (other than [1])?
I would expect it to only guarantee that anything that was written to
the journal will be durable.
Whether the actual write request was honored in full is a separate matter.

Sure, ext4 + data=journal only has "atomic writes" in the sense that what was written in the journal transaction/commit would be completely commited into the main filesystem.

But from the application point of view, this could be very well a partial write. This is exactly the point I am stressing: durable writes does *not* means atomicity in the true sense (ie: from application standpoint).

In this regards, I would imagine for ZFS to behave similarly: at TXG commit, anything buffered in RAM (and replicated by the ZIL) would be committed to the main filesystem, but if the application write itself was incomplete (due to an application crash) *and* application-side doublebuffer was disabled, bad thing could happen...

Please report back any findings, whether or not you consider them to
be interesting.

I believe that it is technically possible for a copy-on-write
filesystem like ZFS
to support atomic writes, but for that to be possible in practice, the
interfaces
inside the kernel must be implemented in an appropriate way.
Disclaimer: I have no knowledge of the implementation details of any kernel.

I would expect (and I can be wrong!) that "atomic writes" in MySQL/MariaDB context means more that durable writes; rather, I expect them to be a means for communicate to the lower layer (ie: storage device) the application consistency model. Something similar to "buffer all writes and atomically write them into the main filesystem only when I (MariaDB) *explicitly* tell you to do that". In this case, a crashed MariaDB will *never* commit the partial data to the main database files.

I wrote a test program[1] which spawn a child appending data to a backing file, killing (-9) it via the parent process at random time. It seem *very* difficult to cause any sort of partial, both on ext4 (even with no data journal!) and ZFS. You basically had to interrupt the write() call at a very precise moment, and good luck doing that, especially when writing small data chunks.

So it really seems that a doublewrite-less MariaDB would be safe from corruption unless extraordinary bad luck (ie: mysqld crash at a *really small* wrong moment) hits.

I plan to do some more test with a "real" MariaDB installation being crashed in the middle of intense writes. I'll update you when done.

Test setup:
- CentOS 7 x86-64 VM on KVM host
- 1 GB RAM
- 8 GB disk
- ext4 (data=ordered) and zfs filesystem ((compression=off, xattr=sa, recordize=16k)) created on top of a ~400 MB files under /dev/shm (basically a RAMDISK), mounted on /mnt/
- varying buffer size (16k, 128k and 4m)

Results...

# ext4 16k
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
   1000 16      /mnt/append.txt
   1000 ec6affcd48d0f33be5cb211f99453b73  /mnt/append.txt

# ext4 128k
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
   1000 128     /mnt/append.txt
   1000 8f607cfdd2c87d6a7eedb657dafbd836  /mnt/append.txt

# ext4 4m <-- PARTIAL WRITES DETECTED
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
      1 1624    /mnt/append.txt
      1 2892    /mnt/append.txt
    998 4096    /mnt/append.txt
      1 5ab53863a602f93aaef0c7578bb2f91d  /mnt/append.txt
      1 c67e09d43084ce17cef2f844482bf9a9  /mnt/append.txt
    998 d5e9dca290ea8d856183557a31d5eb72  /mnt/append.txt

Ext4 summary: partial write detected only when buffersize == 4m

zfs 16k (compression=off, xattr=sa, recordize=16k)
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
   1000 16      /mnt/append.txt
   1000 ec6affcd48d0f33be5cb211f99453b73  /mnt/append.txt

zfs 128k (compression=off, xattr=sa, recordize=16k)
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
      4 0       /mnt/append.txt
    996 128     /mnt/append.txt
   1000 8f607cfdd2c87d6a7eedb657dafbd836  /mnt/append.txt

zfs 4m (compression=off, xattr=sa, recordize=16k)
[root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c
    353 0       /mnt/append.txt
    647 4096    /mnt/append.txt
   1000 d5e9dca290ea8d856183557a31d5eb72  /mnt/append.txt

ZFS summary: no partial write detected, albeit apparent file size was sometime wrong (it can be a lazy metadata update; md5sum was always correct).

I hope the above data to be interesting. If I did something wrong, please let me know.

---

[1] Test program
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <signal.h>

#define MAX_COUNT 1
#define MAX_WAIT 1000
#define BUF_SIZE 16*1024                // or 128*1024 or 4*1024*1024

void  ChildProcess(void);
void  ParentProcess(pid_t);

void  main(void)
{
        pid_t  pid;
        int i;

        for (i = 0; i < MAX_COUNT; i++) {
                pid = fork();
                if (pid == 0)
                        ChildProcess();
                else
                        ParentProcess(pid);
        }
}

void  ChildProcess(void)
{
        int fd;
        int res;
        char *str;
        str = (char *) malloc(BUF_SIZE);
        memset(str,48,BUF_SIZE);
fd = open( "/mnt/append.txt" , O_SYNC | O_WRONLY | O_TRUNC | O_CREAT);
        while (1)  {
                lseek(fd, 0, SEEK_SET);
                res = write(fd, str, BUF_SIZE);
        }
        close(fd);
}

void  ParentProcess(pid_t pid)
{
        struct timeval tv;
        int res = 0;
        int rnd = 0;
        res = gettimeofday(&tv, NULL);
        srand(tv.tv_usec);
        rnd = random() % MAX_WAIT;
        usleep(rnd);
        kill(pid, 9);
}


--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8


Follow ups

References