maria-discuss team mailing list archive

Thread
Date

Re: Collations, trailing spaces and unique indexes

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>, Binarus <lists@xxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Fri, 11 Mar 2016 18:45:54 +0400
Cc: maria-discuss@xxxxxxxxxxxxxxxxxxx
In-reply-to: <56E2A9F1.3060002@mariadb.org>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

FYI, I have added a new task for this:

https://jira.mariadb.org/browse/MDEV-9711


On 03/11/2016 03:20 PM, Alexander Barkov wrote:
> Hello Binarus, Kristian,
> 
> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>> Binarus <lists@xxxxxxxxxx> writes:
>>
>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>> trailing spaces. “Comparison” in this context does not include the
>>
>> Yes, I have always found this terminally stupid as well. But I think it
>> comes from the SQL standard.
> 
> Probably this comes from here:
> 
>> 2) A <standard character set name> specifies the name of a character set that is defined by a national or
>> international standard. The character repertoire of CS is defined by the standard defining the character set
>> identified by that <standard character set name>. The default collation of the character set is defined by
>> the order of the characters in the standard and has the PAD SPACE characteristic.
> 
> So we follow the PAD SPACE requirement for default collations here.
> 
> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
> by default. I agree with the standard here. We don't want different
> behavior for CHAR and VARCHAR.
> 
> 
> Btw, we don't follow the requirement that the default collation must be
> defined according to "the order of the characters in the standard".
> Default collations are traditionally case insensitive in MariaDB/MySQL.
> 
>>
>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>> think it works much the same in most respects. But obviously some semantics
>> is lost when the server no longer is aware of the character set used.
> 
> Correct.
> 
>>
>>> Since the index behaviour obviously depends on the collation, would
>>> building an own collation which does not PADSPACE be an option? I have
>>
>> That would be interesting, actually. I don't know what support there is for
>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
> 
> We don't have NO PAD collations yet.
> I don't remember that anybody ever asked for this before.
> 
> From a glance, this should not be too hard to implement.
> 
> You previously wrote you need utf8.
> So there are three options to make a new collation,
> based on one of these existing collations:
> 
> - utf8_general_ci
> - utf8_unicode_ci
> - utf8_bin
> 
> but with NO PAD characteristics.
> 
> Suppose you need utf8_general_nopad_ci
> (i.e. based on utf8_general_ci)
> 
> 
> utf8_general_ci is implemented in strings/ctype-utf8.c
> 
> A new collation handler should be added, similar to this one:
> 
> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
> {
>     NULL,               /* init */
>     my_strnncoll_utf8_general_ci,
>     my_strnncollsp_utf8_general_ci,
>     my_strnxfrm_unicode,
>     my_strnxfrmlen_unicode,
>     my_like_range_mb,
>     my_wildcmp_utf8,
>     my_strcasecmp_utf8,
>     my_instr_mb,
>     my_hash_sort_utf8,
>     my_propagate_complex
> };
> 
> 
> but these three virtual functions must be redefined to new
> similar functions that do not ignore trailing spaces:
> 
> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
> - my_hash_sort_utf8              - this is used for HASH indexes
> - my_strnxfrm_unicode            - this is used for filesort
>                                    (non-indexed ORDER BY)
> 
> All other functions can be reused from the existing PAD SPACE collation.
> 
> 
> 
> Then a new "struct charset_info_st" should be defined,
> similar to my_charset_utf8_general_ci.
> 
> Sounds like a few hours of work.
> 
> 
> But then thorough testing will be needed,
> which will be the most time consuming part.
> 
> 
> Also, for consistency, it's worthy to implement
> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
> and then eventually NO PAD collations for all other character sets.
> 
>>
>>  - Kristian.
>>

Follow ups

Re: Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11

References

Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Kristian Nielsen, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-11