← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

Hi Tom,


On 03/18/2016 07:30 PM, Tom Worster wrote:
> This conversation made me curious.
> 
> Which code points does a PADSPACE collation treat as trailing spaces?

Usually, just U+0020 SPACE.

But this actually depends on an exact collation.


This one treats U+00A0 NO-BREAK SPACE as equal to U+0020:

http://collation-charts.org/mysql60/mysql604.cp1250_general_ci.html

and thus ignores both trailing regular spaces and no-break spaces.


> Would the imagined utf8_*_nopad_* collations treat all of these code
> points as significant?

Yes, this is the idea.


> 
> Tom
> 
> 
> 
> On 3/11/16, 6:20 AM, "Maria-discuss on behalf of Alexander Barkov"
> <maria-discuss-bounces+fsb=thefsb.org@xxxxxxxxxxxxxxxxxxx on behalf of
> bar@xxxxxxxxxxx> wrote:
> 
>> Hello Binarus, Kristian,
>>
>> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>>> Binarus <lists@xxxxxxxxxx> writes:
>>>
>>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>>> trailing spaces. “Comparison” in this context does not include the
>>>
>>> Yes, I have always found this terminally stupid as well. But I think it
>>> comes from the SQL standard.
>>
>> Probably this comes from here:
>>
>>> 2) A <standard character set name> specifies the name of a character
>>> set that is defined by a national or
>>> international standard. The character repertoire of CS is defined by
>>> the standard defining the character set
>>> identified by that <standard character set name>. The default collation
>>> of the character set is defined by
>>> the order of the characters in the standard and has the PAD SPACE
>>> characteristic.
>>
>> So we follow the PAD SPACE requirement for default collations here.
>>
>> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
>> by default. I agree with the standard here. We don't want different
>> behavior for CHAR and VARCHAR.
>>
>>
>> Btw, we don't follow the requirement that the default collation must be
>> defined according to "the order of the characters in the standard".
>> Default collations are traditionally case insensitive in MariaDB/MySQL.
>>
>>>
>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>> think it works much the same in most respects. But obviously some
>>> semantics
>>> is lost when the server no longer is aware of the character set used.
>>
>> Correct.
>>
>>>
>>>> Since the index behaviour obviously depends on the collation, would
>>>> building an own collation which does not PADSPACE be an option? I have
>>>
>>> That would be interesting, actually. I don't know what support there is
>>> for
>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>
>> We don't have NO PAD collations yet.
>> I don't remember that anybody ever asked for this before.
>>
>>From a glance, this should not be too hard to implement.
>>
>> You previously wrote you need utf8.
>> So there are three options to make a new collation,
>> based on one of these existing collations:
>>
>> - utf8_general_ci
>> - utf8_unicode_ci
>> - utf8_bin
>>
>> but with NO PAD characteristics.
>>
>> Suppose you need utf8_general_nopad_ci
>> (i.e. based on utf8_general_ci)
>>
>>
>> utf8_general_ci is implemented in strings/ctype-utf8.c
>>
>> A new collation handler should be added, similar to this one:
>>
>> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
>> {
>>    NULL,               /* init */
>>    my_strnncoll_utf8_general_ci,
>>    my_strnncollsp_utf8_general_ci,
>>    my_strnxfrm_unicode,
>>    my_strnxfrmlen_unicode,
>>    my_like_range_mb,
>>    my_wildcmp_utf8,
>>    my_strcasecmp_utf8,
>>    my_instr_mb,
>>    my_hash_sort_utf8,
>>    my_propagate_complex
>> };
>>
>>
>> but these three virtual functions must be redefined to new
>> similar functions that do not ignore trailing spaces:
>>
>> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
>> - my_hash_sort_utf8              - this is used for HASH indexes
>> - my_strnxfrm_unicode            - this is used for filesort
>>                                   (non-indexed ORDER BY)
>>
>> All other functions can be reused from the existing PAD SPACE collation.
>>
>>
>>
>> Then a new "struct charset_info_st" should be defined,
>> similar to my_charset_utf8_general_ci.
>>
>> Sounds like a few hours of work.
>>
>>
>> But then thorough testing will be needed,
>> which will be the most time consuming part.
>>
>>
>> Also, for consistency, it's worthy to implement
>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>> and then eventually NO PAD collations for all other character sets.
>>
>>>
>>>  - Kristian.
>>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~maria-discuss
>> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~maria-discuss
>> More help   : https://help.launchpad.net/ListHelp
> 
> 


References