maria-developers team mailing list archive

Thread
Date

Re: accelerating CREATE TABLE

To: Toshikuni Fukaya <toshikuni-fukaya@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 08 Feb 2013 15:52:52 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <510B2CDC.7070600@cybozu.co.jp> (Toshikuni Fukaya's message of "Fri, 01 Feb 2013 11:47:56 +0900")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

Toshikuni Fukaya <toshikuni-fukaya@xxxxxxxxxxxx> writes:

> I made a patch to accelerate CREATE TABLE on the innodb plugin.
> To zero table spaces, I used fallocate instead of normal writes and sync.

Thanks for your work. Today I took a closer look at the patch and the deeper
issues.

I think we need to understand first better

1. What are the inefficiencies of the current code.

2. What is the effect of the patch, why does it speed things up.

I mean, understand in terms of exactly what I/O operations are performed on
the disk in the different cases. Mostly for the ext4 and XFS file systems,
maybe there are others of interest.

The current code does a series of 1MB writes at the end of the file to extend
the file. I think these writes are done with O_DIRECT if O_DIRECT is
enabled. I do not see any fsync() or fdatasync() call at the end.

Did you run your benchmarks with O_DIRECT or without? How much is the
tablespace extended with for every call to fil_extend_space_to_desired_size()?

One possible inefficiency is if each 1MB O_DIRECT write flushes to disk both
the data written and also the new size of the file. I did not find conclusive
answer to this one way or the other, maybe it depends on the file system? An
1MB sequential write to a non-SSD harddisk costs around the same as one random
I/O, so this alone could double the time needed.

Another potential inefficiency is that the existing code first writes zero to
each data page. But then what happens when the page is first needed? I assume
it is not read, rather a new page is initialised and written. So if
fallocate() on the given system can just mark the block allocated, then we can
save the initial write of zeros, just writing the initial page later.

On the other hand, that initial write will then need to update metadata saying
that the disk blocks are now in use. So you need to also benchmark the cost of
both creating the table and then afterwards filling it up with data, in a
situation where the I/O is the bottlenect, not CPU. This needs to be done a
bit carefully to ensure that the tablespace pages are actually written
(initially only the redo log is written).

So possible reasons for the speedup from the patch include:

 - Less syncing of metadata to disk during extending the file.

 - Saving initial write of disk page.

I want to understand which of these are in play, if any, and if there are
other effects.

Apart from researching documentation and so on, one way to understand this
better is to run benchmarks inside a virtual machine like kvm, and run strace
on the kvm process in the host. This shows all I/O operations to the disk.

On the other hand, using fallocate to extend the file in one go gives the file
system more information than writing it in pieces. So this could potentially
be a better method. But it touches a core part of I/O performance, so we need
to understand it.

I have also some smaller comments on the patch itself, given inline below. But
we should resolve the above general problems first, probably:

> Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.)
> Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)

The patch would have to be made against MariaDB 10.0. But I looked, the code
looks much the same in 10.0 and 5.5, so should not be too hard.

> +	ibool		fallenback	= TRUE;

> +			fallenback = FALSE;
> +			goto extend_after;

> +extend_after:
> +	fil_node_complete_io(node, fil_system, fallenback ? OS_FILE_WRITE : OS_FILE_READ);

I think the two different methods of extending the file should be done in
separate functions called from here. Then the complexity with the `fallenback'
flag and the goto is not needed.

Also, I think there should be an fdatasync() after fallocate(). Or what do you
think, is it not needed, and why not? What is the typical size that a
tablespace is extended with? The os_file_set_size() does have an fdatasync().

 - Kristian.

Follow ups

Re: accelerating CREATE TABLE
From: Toshikuni Fukaya, 2013-02-19

References

accelerating CREATE TABLE
From: Toshikuni Fukaya, 2013-01-22
Re: accelerating CREATE TABLE
From: Toshikuni Fukaya, 2013-02-01