maria-discuss team mailing list archive
-
maria-discuss team
-
Mailing list archive
-
Message #03434
Re: Collations, trailing spaces and unique indexes
FYI, I have added a new task for this:
https://jira.mariadb.org/browse/MDEV-9711
On 03/11/2016 03:20 PM, Alexander Barkov wrote:
> Hello Binarus, Kristian,
>
> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>> Binarus <lists@xxxxxxxxxx> writes:
>>
>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>> trailing spaces. “Comparison” in this context does not include the
>>
>> Yes, I have always found this terminally stupid as well. But I think it
>> comes from the SQL standard.
>
> Probably this comes from here:
>
>> 2) A <standard character set name> specifies the name of a character set that is defined by a national or
>> international standard. The character repertoire of CS is defined by the standard defining the character set
>> identified by that <standard character set name>. The default collation of the character set is defined by
>> the order of the characters in the standard and has the PAD SPACE characteristic.
>
> So we follow the PAD SPACE requirement for default collations here.
>
> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
> by default. I agree with the standard here. We don't want different
> behavior for CHAR and VARCHAR.
>
>
> Btw, we don't follow the requirement that the default collation must be
> defined according to "the order of the characters in the standard".
> Default collations are traditionally case insensitive in MariaDB/MySQL.
>
>>
>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>> think it works much the same in most respects. But obviously some semantics
>> is lost when the server no longer is aware of the character set used.
>
> Correct.
>
>>
>>> Since the index behaviour obviously depends on the collation, would
>>> building an own collation which does not PADSPACE be an option? I have
>>
>> That would be interesting, actually. I don't know what support there is for
>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>
> We don't have NO PAD collations yet.
> I don't remember that anybody ever asked for this before.
>
> From a glance, this should not be too hard to implement.
>
> You previously wrote you need utf8.
> So there are three options to make a new collation,
> based on one of these existing collations:
>
> - utf8_general_ci
> - utf8_unicode_ci
> - utf8_bin
>
> but with NO PAD characteristics.
>
> Suppose you need utf8_general_nopad_ci
> (i.e. based on utf8_general_ci)
>
>
> utf8_general_ci is implemented in strings/ctype-utf8.c
>
> A new collation handler should be added, similar to this one:
>
> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
> {
> NULL, /* init */
> my_strnncoll_utf8_general_ci,
> my_strnncollsp_utf8_general_ci,
> my_strnxfrm_unicode,
> my_strnxfrmlen_unicode,
> my_like_range_mb,
> my_wildcmp_utf8,
> my_strcasecmp_utf8,
> my_instr_mb,
> my_hash_sort_utf8,
> my_propagate_complex
> };
>
>
> but these three virtual functions must be redefined to new
> similar functions that do not ignore trailing spaces:
>
> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
> - my_hash_sort_utf8 - this is used for HASH indexes
> - my_strnxfrm_unicode - this is used for filesort
> (non-indexed ORDER BY)
>
> All other functions can be reused from the existing PAD SPACE collation.
>
>
>
> Then a new "struct charset_info_st" should be defined,
> similar to my_charset_utf8_general_ci.
>
> Sounds like a few hours of work.
>
>
> But then thorough testing will be needed,
> which will be the most time consuming part.
>
>
> Also, for consistency, it's worthy to implement
> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
> and then eventually NO PAD collations for all other character sets.
>
>>
>> - Kristian.
>>
Follow ups
References