← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

On 11.03.2016 15:45, Alexander Barkov wrote:
> FYI, I have added a new task for this:
> 
> https://jira.mariadb.org/browse/MDEV-9711
> 

Alexander, not sure if my question makes any sense, but what about the _cs collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have these, too?

Regards,

Binarus


> 
> On 03/11/2016 03:20 PM, Alexander Barkov wrote:
>> Hello Binarus, Kristian,
>>
>> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>>> Binarus <lists@xxxxxxxxxx> writes:
>>>
>>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>>> trailing spaces. “Comparison” in this context does not include the
>>>
>>> Yes, I have always found this terminally stupid as well. But I think it
>>> comes from the SQL standard.
>>
>> Probably this comes from here:
>>
>>> 2) A <standard character set name> specifies the name of a character set that is defined by a national or
>>> international standard. The character repertoire of CS is defined by the standard defining the character set
>>> identified by that <standard character set name>. The default collation of the character set is defined by
>>> the order of the characters in the standard and has the PAD SPACE characteristic.
>>
>> So we follow the PAD SPACE requirement for default collations here.
>>
>> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
>> by default. I agree with the standard here. We don't want different
>> behavior for CHAR and VARCHAR.
>>
>>
>> Btw, we don't follow the requirement that the default collation must be
>> defined according to "the order of the characters in the standard".
>> Default collations are traditionally case insensitive in MariaDB/MySQL.
>>
>>>
>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>> think it works much the same in most respects. But obviously some semantics
>>> is lost when the server no longer is aware of the character set used.
>>
>> Correct.
>>
>>>
>>>> Since the index behaviour obviously depends on the collation, would
>>>> building an own collation which does not PADSPACE be an option? I have
>>>
>>> That would be interesting, actually. I don't know what support there is for
>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>
>> We don't have NO PAD collations yet.
>> I don't remember that anybody ever asked for this before.
>>
>> From a glance, this should not be too hard to implement.
>>
>> You previously wrote you need utf8.
>> So there are three options to make a new collation,
>> based on one of these existing collations:
>>
>> - utf8_general_ci
>> - utf8_unicode_ci
>> - utf8_bin
>>
>> but with NO PAD characteristics.
>>
>> Suppose you need utf8_general_nopad_ci
>> (i.e. based on utf8_general_ci)
>>
>>
>> utf8_general_ci is implemented in strings/ctype-utf8.c
>>
>> A new collation handler should be added, similar to this one:
>>
>> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
>> {
>>     NULL,               /* init */
>>     my_strnncoll_utf8_general_ci,
>>     my_strnncollsp_utf8_general_ci,
>>     my_strnxfrm_unicode,
>>     my_strnxfrmlen_unicode,
>>     my_like_range_mb,
>>     my_wildcmp_utf8,
>>     my_strcasecmp_utf8,
>>     my_instr_mb,
>>     my_hash_sort_utf8,
>>     my_propagate_complex
>> };
>>
>>
>> but these three virtual functions must be redefined to new
>> similar functions that do not ignore trailing spaces:
>>
>> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
>> - my_hash_sort_utf8              - this is used for HASH indexes
>> - my_strnxfrm_unicode            - this is used for filesort
>>                                    (non-indexed ORDER BY)
>>
>> All other functions can be reused from the existing PAD SPACE collation.
>>
>>
>>
>> Then a new "struct charset_info_st" should be defined,
>> similar to my_charset_utf8_general_ci.
>>
>> Sounds like a few hours of work.
>>
>>
>> But then thorough testing will be needed,
>> which will be the most time consuming part.
>>
>>
>> Also, for consistency, it's worthy to implement
>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>> and then eventually NO PAD collations for all other character sets.
>>
>>>
>>>  - Kristian.
>>>


Follow ups

References