maria-developers team mailing list archive

Thread
Date

Re: GSoC 2016:Unique indexes for blobs

To: Shubham Barai <shubhambaraiss@xxxxxxxxx>
From: Jan Lindström <jan.lindstrom@xxxxxxxxxxx>
Date: Wed, 23 Mar 2016 20:30:50 +0200
Cc: "maria-developers@xxxxxxxxxxxxxxxxxxx" <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CALxAEPtiQJPwBVP8q=n2fWSjH6yfTGVJ5pqetBe_G=4X75bK_Q@mail.gmail.com>

Hi,

All below is correct and naturally you may re-implement duplicate search on
different index type. I think there is more code that needs change as this
new index type would contain a column (hash value of the blob field) that
is not on stored on base table, right ? Indexed columns are also stored
inside a InnoDB data dictionary persistently, so that part would also need
change (dict/dict0*.cc files). Actually, you could store blob has also to
table, it could make things easier. Secondly, remember that unique keys can
be used inside a InnoDB as foreign keys, this is again a design question,
do you allow blobs to be foreign keys or not. Finally, unique key with NOT
NULL can be used as primary key i.e. clustered key on InnoDB, using blobs
hash on that might be out of reach on this timetable.

R: Jan Lindström
Principal Engineer
InnoDB

On Wed, Mar 23, 2016 at 6:38 PM, Shubham Barai <shubhambaraiss@xxxxxxxxx>
wrote:

> Hello Sergei,
>
>              I have gone through some of the source files in InnoDB.
>
>                                   / handler/ha_innodb.cc
>                                    /row/row0mysql.cc
>                                    /row/row0ins.cc
>                                    /dict/dict0mem.cc  etc.
>
> In MyISAM,
>      MI_KEYDEF and MI_UNIQUEDEF structure have most of the variables same
> except MI_KEYDEF uses some extra variables like keylength, minlength,
> maxlength .
>      As we are calculating   a hash of  the column values for long UNIQUE
> constraints ,MI_UNIQUEDEF doesn't need them.
>
> mi_write(inserts a row in MyISAM table) and mi_update(updates a row)
> call mi_unique_check to verify all unique constraints before inserting or
> updating any row.
>
> In InnoDB ,
> dict_index_t is the data structure used for an index.  I think we can use
>  the dict_index_t  structure  for our hash based index .
> It has a type-variable which represents the type of an index (DICT_CLUSTERED,
> DICT_UNIQUE,DICT_UNIVERSAL etc)
>
> For our new index, we can define a new type (let's say
> DICT_NON_UNIQUE_HASH).
>
> Insertion of a row in InnoDB  has a function call graph like this
>
> ha_innobase::write_row(uchar* record)
>
>                 [image: Inline images 1]
>
>  row_insert_for_mysql(byte* mysql_rec, row_prebuilt_t* prebuilt)
>
>                 [image: Inline images 1]
>
> row_ins_step(que_thr_t* thr)
>
>                 [image: Inline images 5]
>
> row_ins(ins_node_t* node,que_thr_t* thr)
>
>                  [image: Inline images 7]
>
> row_ins_index_entry_step(ins_node_t* node,que_thr_t* thr)
>
> [image: Inline images 7]
>
> row_ins_index_entry_set_vals(dict_index_t* then row_ins_index_entry (
> dict_index_t* index,  dtuple_t* entry, que_thr_t* thr)   index,dtuple_t*
> entry,const dtuple_t* row)
>
> [image: Inline images 1]
>
>   row_ins_clust_index_entry/
>
> row_ins_sec_index_entry
>
> [image: Inline images 3]
>
>
> row_ins_sec_index_entry_low
>
> [image: Inline images 4]
>
> row_ins_scan_sec_index_for_duplicate
>
>
>
> 1. row_ins(inserts a row to a table)  calls   row_ins_index_entry_step
> for each index defined on a table.
>
>
> 2.  row_ins_index_entry_step first calls    row_ins_index_entry_set_vals
> and then   row_ins_index_entry   .
>
>
>   row_ins_index_entry_set_vals is a function which sets the values of the dtuple
> fields  in entry  from the appropriate columns   in row.
>
> entry is the index tuple which we are going to insert into the index.
>
>
>
>
> 3.  In our hash based index ,instead of storing the actual key value in
> the index, we have to store an only hash of the column values in a key .
>
>
> So for  our new DICT_NON_UNIQUE_HASH index ,we can define a  new function
>
>   row_ins_index_entry_set_vals_hash which will calculate hash of all the
> columns values in a key and create an index entry to insert into index
>
>
> 4. row_ins_sec_index_entry_low is the function which tries to insert an
> entry into the secondary index. If the type of the index is DICT_UNIQUE,
>
> it calls row_ins_scan_sec_index_for_duplicate.
> row_ins_scan_sec_index_for_duplicate scans a unique non-clustered index
>
> to check unique constraints.
>
>
> In our case (if the type is DICT_NON_UNIQUE_HASH), we can define a new
> function row_ins_scan_sec_index_for_duplicate_hash which will scan the
> index and if there are identical hashes in the index,it will retrieve the
> rows and compare actual values.
>
>
> This is my initial approach for implementing long unique constraints in
> InnoDB.
>
> Do you think this approach will work?
>
>
> I would really appreciate your suggestions.
>
>
> Thanks,
>
> Shubham
>
> On 22 March 2016 at 20:58, Shubham Barai <shubhambaraiss@xxxxxxxxx> wrote:
>
>> Hello Sergei,
>>
>>          I understood most of the internals of long unique constraints in
>> MyISAM. I am still going through the code in InnoDB. I will soon reply to
>> you.
>>
>> Thanks,
>> Shubham
>>
>> On 21 March 2016 at 16:37, Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>>
>>> Hi, Shubham!
>>>
>>> On Mar 21, Shubham Barai wrote:
>>> >
>>> > I am currently looking into InnoDB codebase to see if it is possible
>>> > for me to extend this feature to InnoDB storage engine. As  InnoDB
>>> > doesn't support this feature internally, it would require more time
>>> > than MyISAM.  Any suggestions regarding this would be helpful.
>>>
>>> Heh, that's very good (and ambitious) :)
>>>
>>> Do you already know how MyISAM supports arbitrary long UNIQUE
>>> constraints internally? It stores only the hash of the value (of the
>>> blob, for example) in the non-unique index, and on INSERT it checks if
>>> there are identical hashes in the index. If there are (hash collision)
>>> it will retrieve the rows and compare actual blob values.
>>>
>>> It seems that InnoDB should use the same approach as MyISAM. It'll need
>>> some care for a case when two concurrent transactions insert conflicting
>>> rows - as transaction changes are not visible until commit, you won't
>>> see the conflict until it's too late. But gap locks [1] should be able
>>> to prevent that.
>>>
>>> Regards,
>>> Sergei
>>> Chief Architect MariaDB
>>> and security@xxxxxxxxxxx
>>>
>>> [1]
>>> https://dev.mysql.com/doc/refman/5.7/en/innodb-record-level-locks.html
>>>
>>
>>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp
>
>

Follow ups

Re: GSoC 2016:Unique indexes for blobs
From: Sergei Golubchik, 2016-03-23

References

GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-03
Re: GSoC 2016:Unique indexes for blobs
From: Sergei Golubchik, 2016-03-04
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-16
Re: GSoC 2016:Unique indexes for blobs
From: Sergei Golubchik, 2016-03-16
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-19
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-20
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-20
Re: GSoC 2016:Unique indexes for blobs
From: Sergei Golubchik, 2016-03-20
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-21
Re: GSoC 2016:Unique indexes for blobs
From: Sergei Golubchik, 2016-03-21
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-22
Re: GSoC 2016:Unique indexes for blobs
From: Shubham Barai, 2016-03-23