← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

Hi Binarus,

On 03/11/2016 09:25 PM, Binarus wrote:
> On 11.03.2016 15:45, Alexander Barkov wrote:
>> FYI, I have added a new task for this:
>>
>> https://jira.mariadb.org/browse/MDEV-9711
>>
> 
> Alexander, not sure if my question makes any sense, but what about the _cs collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have these, too?

We do, eventually.

I just wanted to split the task into these steps:

1. Default and _bin collations.
   This is what MDEV-9711 is about.

2. Unicode Collation Algorithm (UCA) based collations
   (e.g. utf8_unicode_nopad_ci)


3. The rest (not covered by the previous steps)
   This will include latin1_general_nopad_cs, latin2_czech_nopad_cs


This is why asked in the previous letter if utf8_unicode_nopad_ci would
work for you.
Also, which MariaDB version do you need this for? Will 10.1 work?

> 
> Regards,
> 
> Binarus
> 
> 
>>
>> On 03/11/2016 03:20 PM, Alexander Barkov wrote:
>>> Hello Binarus, Kristian,
>>>
>>> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>>>> Binarus <lists@xxxxxxxxxx> writes:
>>>>
>>>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>>>> trailing spaces. “Comparison” in this context does not include the
>>>>
>>>> Yes, I have always found this terminally stupid as well. But I think it
>>>> comes from the SQL standard.
>>>
>>> Probably this comes from here:
>>>
>>>> 2) A <standard character set name> specifies the name of a character set that is defined by a national or
>>>> international standard. The character repertoire of CS is defined by the standard defining the character set
>>>> identified by that <standard character set name>. The default collation of the character set is defined by
>>>> the order of the characters in the standard and has the PAD SPACE characteristic.
>>>
>>> So we follow the PAD SPACE requirement for default collations here.
>>>
>>> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
>>> by default. I agree with the standard here. We don't want different
>>> behavior for CHAR and VARCHAR.
>>>
>>>
>>> Btw, we don't follow the requirement that the default collation must be
>>> defined according to "the order of the characters in the standard".
>>> Default collations are traditionally case insensitive in MariaDB/MySQL.
>>>
>>>>
>>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>>> think it works much the same in most respects. But obviously some semantics
>>>> is lost when the server no longer is aware of the character set used.
>>>
>>> Correct.
>>>
>>>>
>>>>> Since the index behaviour obviously depends on the collation, would
>>>>> building an own collation which does not PADSPACE be an option? I have
>>>>
>>>> That would be interesting, actually. I don't know what support there is for
>>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>>
>>> We don't have NO PAD collations yet.
>>> I don't remember that anybody ever asked for this before.
>>>
>>> From a glance, this should not be too hard to implement.
>>>
>>> You previously wrote you need utf8.
>>> So there are three options to make a new collation,
>>> based on one of these existing collations:
>>>
>>> - utf8_general_ci
>>> - utf8_unicode_ci
>>> - utf8_bin
>>>
>>> but with NO PAD characteristics.
>>>
>>> Suppose you need utf8_general_nopad_ci
>>> (i.e. based on utf8_general_ci)
>>>
>>>
>>> utf8_general_ci is implemented in strings/ctype-utf8.c
>>>
>>> A new collation handler should be added, similar to this one:
>>>
>>> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
>>> {
>>>     NULL,               /* init */
>>>     my_strnncoll_utf8_general_ci,
>>>     my_strnncollsp_utf8_general_ci,
>>>     my_strnxfrm_unicode,
>>>     my_strnxfrmlen_unicode,
>>>     my_like_range_mb,
>>>     my_wildcmp_utf8,
>>>     my_strcasecmp_utf8,
>>>     my_instr_mb,
>>>     my_hash_sort_utf8,
>>>     my_propagate_complex
>>> };
>>>
>>>
>>> but these three virtual functions must be redefined to new
>>> similar functions that do not ignore trailing spaces:
>>>
>>> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
>>> - my_hash_sort_utf8              - this is used for HASH indexes
>>> - my_strnxfrm_unicode            - this is used for filesort
>>>                                    (non-indexed ORDER BY)
>>>
>>> All other functions can be reused from the existing PAD SPACE collation.
>>>
>>>
>>>
>>> Then a new "struct charset_info_st" should be defined,
>>> similar to my_charset_utf8_general_ci.
>>>
>>> Sounds like a few hours of work.
>>>
>>>
>>> But then thorough testing will be needed,
>>> which will be the most time consuming part.
>>>
>>>
>>> Also, for consistency, it's worthy to implement
>>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>>> and then eventually NO PAD collations for all other character sets.
>>>
>>>>
>>>>  - Kristian.
>>>>
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
> 


Follow ups

References