maria-discuss team mailing list archive

Thread
Date

Re: Very slow insert with load data infile

To: maria-discuss@xxxxxxxxxxxxxxxxxxx
From: Reindl Harald <h.reindl@xxxxxxxxxxxxx>
Date: Wed, 12 Feb 2014 11:15:16 +0100
In-reply-to: <BLU0-SMTP1471D43BA6BBBA46139E579C7920@phx.gbl>
Openpgp: id=7F780279; url=http://arrakis.thelounge.net/gpg/h.reindl_thelounge.net.pub.txt
Organization: the lounge interactive design
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

your keys *must* fit in the RAM, period
your key-buffers must be large enough

Am 12.02.2014 11:10, schrieb Pierre:
> You don't undestand, I already did this. Read the thread since the beginning :
> 
> https://lists.launchpad.net/maria-discuss/msg01338.html <= load data in empty table with index
> https://lists.launchpad.net/maria-discuss/msg01361.html <= Load the data in empty table THEN add the index (what
> you are suggesting).
> 
> Le 12/02/2014 11:03, Justin Swanhart a écrit :
>> Load the data THEN add the index.  This will do the unique check once instead of on every insertion.
>>
>> On bloom filter miss, select from the table and insert if it isn't there.  If this is multithreaded use innodb or
>> tokudb and select .... for update, to prevent race.
>>
>> Sent from my iPhone
>>
>>> On Feb 12, 2014, at 12:49 AM, Pierre <pierz@xxxxxxxxxx> wrote:
>>>
>>> Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something
>>> which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
>>>
>>> Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique
>>> constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug
>>> explained before.
>>>
>>> If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the
>>> duplicate and create the UNIQUE index.
>>>
>>>
>>> Le 12/02/2014 09:24, Justin Swanhart a écrit :
>>>> Hi,
>>>>
>>>> This is not a bug, but how b tree indexes work.  For them to be efficient they must fit in ram.
>>>>    There are buffering mechanisms that can be used for secondary indexes in some cases, because the
>>>> write can be done without a read, but ONLY when the index is not unique.  It if it unique, then the
>>>> index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
>>>>
>>>> Faster IO can help, so you can look into flash storage.
>>>>
>>>> Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many
>>>> implementations on the net) to reduce lookups.
>>>>
>>>> --Justin
>>>
>>> Le 12/02/2014 09:04, Pierre a écrit :
>>>> Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and
>>>> looking at show full processlist, it's getting slower and slower. It will never finish.
>>>>
>>>> I think there is a bug here.
>>>>
>>>> Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only
>>>> using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this
>>>> task.
>>>>
>>>> So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and
>>>> filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most
>>>> of the time.
>>>>
>>>> So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
>>>>
>>>> However if I try to create a standard index (ie non unique), it works well and it's done in less
>>>> than an hour.
>>>>
>>>> ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2);
>>>> Query OK, 349086532 rows affected (44 min 25.21 sec)
>>>> Records: 349086532  Duplicates: 0  Warnings: 0
>>>>
>>>> In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think
>>>> it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting
>>>> them, merging them and creating index.
>>>>
>>>> It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM
>>>> from hard disk (and not doing a random access on the hard disk like in create unique index). So why
>>>> UNIQUE index is not behaving the same way ?
>>>>
>>>> It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data
>>>> infile and try to create a UNIQUE index on it. It will never end.
>>>>
>>>>> Am 10.02.2014 13:45, schrieb Pierre:
>>>>>> Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to
>>>> insert 166 507 066
>>>>>> rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the
>>>> same size because
>>>>>> it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be
>>>> extremely slow.
>>>>>
>>>>> he reason are the permamently updated keys
>>>>> in general for large inserts on a new table
>>>>>
>>>>> UNIQUE KEY `p1` (`p1`),
>>>>> UNIQUE KEY `p2` (`p2`),
>>>>>
>>>>> * remove keyes
>>>>> * insert data
>>>>> * add kyes
>>>>>
>>>>> https://dev.mysql.com/doc/refman/5.5/en/insert-speed.html

Attachment: signature.asc
Description: OpenPGP digital signature

Follow ups

Re: Very slow insert with load data infile
From: Pierre, 2014-02-12

References

Very slow insert with load data infile
From: Pierre, 2014-02-10
Re: Very slow insert with load data infile
From: Reindl Harald, 2014-02-10
Re: Very slow insert with load data infile
From: Pierre, 2014-02-12
Re: Very slow insert with load data infile
From: Pierre, 2014-02-12
Re: Very slow insert with load data infile
From: Justin Swanhart, 2014-02-12
Re: Very slow insert with load data infile
From: Pierre, 2014-02-12