maria-discuss team mailing list archive

Thread
Date

Re: Very slow insert with load data infile

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Sergey Vojtovich <svoj@xxxxxxxxxxx>
Date: Wed, 12 Feb 2014 22:07:41 +0400
Cc: maria-discuss@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20140212144506.GA15240@meddwl.fritz.box>
User-agent: Mutt/1.5.21 (2010-09-15)

Hi Sergei, Pierre,

On Wed, Feb 12, 2014 at 03:45:06PM +0100, Sergei Golubchik wrote:
> Hi, Pierre!
> 
> Okay, there were quite a few replies already here.
> I'll chime in, but only to confirm that there's no easy solution,
> unfortunately :(
> 
> On Feb 10, Pierre wrote:
> > Hello,
> > 
> > Mariadb is getting slower and slower when I'm inserting a massive
> > amout of data. I'm trying to insert 166 507 066 rows (12go) using load
> > data infile '' into an empty table. I splitted my file in 13 parts of
> > the same size because it was too long to insert in one shot. When I
> > inserted more than 100M rows, it starts to be extremely slow.
> > 
> > Here are the time for each load data infile :
> > 
> > LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec
> > LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec
> > LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec
> > LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec
> > LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec
> > LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec
> > LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec
> > LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec
> > LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec
> > LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !)
> > LOAD DATA INFILE '/tmp/hash/xak' : still running...
> >
> > Can someone explain me what is happening internaly and why it is so slow ?
> 
> Yes. That's because indexes are implemented as B-trees, and B-tree has a
> cost of insertion of O(log N). As you insert more data, insertions
> become slower. Furthermore, B-tree organizes data in pages. MariaDB
> tries to keep all pages in memory, but as soon as index becomes larger
> than keycache size, every key insertion means at least one disk read and
> one write (assuming your key values are distributed normally and
> randomly).
Pierre noted that key_buffer_size is 6Gb.
It should fit about 95M rows (or ~7.5 parts).
According to my calculations resulting index size will be about 11Gb.

Said the above I believe just increasing key_buffer_size may greatly improve
load performance. Note that performance is starting to go down heavily exactly
after 8-th part.

> > Do you think if I just create an index (not a unique index) it will be
> > faster ?
> 
> Not really. Inserting values into a B-tree one-by-one will always be
> slow when your B-tree becomes big enough.
> 
> But you don't need to insert values into a B-tree one by one.  MyISAM
> has a much faster (like, two orders of magnitude) way of building
> b-trees by creating a tree all at once.
> 
> This is used when you insert data into the empty table. Or when you add
> an index to the existing table. Or when you enable indexes. That is:
> 
>  1. LOAD DATA INFILE 'everything, not in chunks'
>     (a convenient way would be to load from a pipe, and cat all your
>     chunks into it).
>  2. ALTER TABLE ... ADD KEY
>  3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS
> 
> But unfortunately, all this only works for non-unique keys, when MariaDB
> doesn't need to expect a unique constraint error and won't need to
> delete rows when creating indexes.  For unique indexes values will be
> inserted into a b-tree one by one.
Indeed, it goes through key cache. Just curious why is that needed for
p.2 - it should never have to delete rows. AFAIR it should just drop
temporary table in case of failure. I guess it was fixed this way in 10.0.

And p.1 can just truncate data file...

> Now, there's another trick you can use. Create a table. Then disable all
> indexes - not with ALTER TABLE ENABLE KEYS (which only disables
> non-unique indexes) - but from a command-line with myisamchk:
> 
>   $ myisamchk --keys-used=0 /path/to/table.MYI
> 
> then insert your data. And REPAIR TABLE or REPAIR TABLE ... QUICK.
> This will rebuild all indexes, the fast way.
> The second variant (with QUICK) will not copy your (presumably, huge)
> data - it will only rebuild indexes. But it cannot remove duplicates
> either.
> 
> And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE
> and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY.

Thanks,
Sergey

Follow ups

Re: Very slow insert with load data infile
From: Sergei Golubchik, 2014-02-12

References

Very slow insert with load data infile
From: Pierre, 2014-02-10
Re: Very slow insert with load data infile
From: Sergei Golubchik, 2014-02-12