← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

This conversation made me curious.

Which code points does a PADSPACE collation treat as trailing spaces?
Would the imagined utf8_*_nopad_* collations treat all of these code
points as significant?

Tom



On 3/11/16, 6:20 AM, "Maria-discuss on behalf of Alexander Barkov"
<maria-discuss-bounces+fsb=thefsb.org@xxxxxxxxxxxxxxxxxxx on behalf of
bar@xxxxxxxxxxx> wrote:

>Hello Binarus, Kristian,
>
>On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>> Binarus <lists@xxxxxxxxxx> writes:
>> 
>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>> trailing spaces. “Comparison” in this context does not include the
>> 
>> Yes, I have always found this terminally stupid as well. But I think it
>> comes from the SQL standard.
>
>Probably this comes from here:
>
>> 2) A <standard character set name> specifies the name of a character
>>set that is defined by a national or
>> international standard. The character repertoire of CS is defined by
>>the standard defining the character set
>> identified by that <standard character set name>. The default collation
>>of the character set is defined by
>> the order of the characters in the standard and has the PAD SPACE
>>characteristic.
>
>So we follow the PAD SPACE requirement for default collations here.
>
>I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
>by default. I agree with the standard here. We don't want different
>behavior for CHAR and VARCHAR.
>
>
>Btw, we don't follow the requirement that the default collation must be
>defined according to "the order of the characters in the standard".
>Default collations are traditionally case insensitive in MariaDB/MySQL.
>
>> 
>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>> think it works much the same in most respects. But obviously some
>>semantics
>> is lost when the server no longer is aware of the character set used.
>
>Correct.
>
>> 
>>> Since the index behaviour obviously depends on the collation, would
>>> building an own collation which does not PADSPACE be an option? I have
>> 
>> That would be interesting, actually. I don't know what support there is
>>for
>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>
>We don't have NO PAD collations yet.
>I don't remember that anybody ever asked for this before.
>
>From a glance, this should not be too hard to implement.
>
>You previously wrote you need utf8.
>So there are three options to make a new collation,
>based on one of these existing collations:
>
>- utf8_general_ci
>- utf8_unicode_ci
>- utf8_bin
>
>but with NO PAD characteristics.
>
>Suppose you need utf8_general_nopad_ci
>(i.e. based on utf8_general_ci)
>
>
>utf8_general_ci is implemented in strings/ctype-utf8.c
>
>A new collation handler should be added, similar to this one:
>
>static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
>{
>    NULL,               /* init */
>    my_strnncoll_utf8_general_ci,
>    my_strnncollsp_utf8_general_ci,
>    my_strnxfrm_unicode,
>    my_strnxfrmlen_unicode,
>    my_like_range_mb,
>    my_wildcmp_utf8,
>    my_strcasecmp_utf8,
>    my_instr_mb,
>    my_hash_sort_utf8,
>    my_propagate_complex
>};
>
>
>but these three virtual functions must be redefined to new
>similar functions that do not ignore trailing spaces:
>
>- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
>- my_hash_sort_utf8              - this is used for HASH indexes
>- my_strnxfrm_unicode            - this is used for filesort
>                                   (non-indexed ORDER BY)
>
>All other functions can be reused from the existing PAD SPACE collation.
>
>
>
>Then a new "struct charset_info_st" should be defined,
>similar to my_charset_utf8_general_ci.
>
>Sounds like a few hours of work.
>
>
>But then thorough testing will be needed,
>which will be the most time consuming part.
>
>
>Also, for consistency, it's worthy to implement
>at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>and then eventually NO PAD collations for all other character sets.
>
>> 
>>  - Kristian.
>> 
>
>_______________________________________________
>Mailing list: https://launchpad.net/~maria-discuss
>Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
>Unsubscribe : https://launchpad.net/~maria-discuss
>More help   : https://help.launchpad.net/ListHelp




Follow ups

References