← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

Hello Binarus, Kristian,

On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
> Binarus <lists@xxxxxxxxxx> writes:
> 
>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>> trailing spaces. “Comparison” in this context does not include the
> 
> Yes, I have always found this terminally stupid as well. But I think it
> comes from the SQL standard.

Probably this comes from here:

> 2) A <standard character set name> specifies the name of a character set that is defined by a national or
> international standard. The character repertoire of CS is defined by the standard defining the character set
> identified by that <standard character set name>. The default collation of the character set is defined by
> the order of the characters in the standard and has the PAD SPACE characteristic.

So we follow the PAD SPACE requirement for default collations here.

I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
by default. I agree with the standard here. We don't want different
behavior for CHAR and VARCHAR.


Btw, we don't follow the requirement that the default collation must be
defined according to "the order of the characters in the standard".
Default collations are traditionally case insensitive in MariaDB/MySQL.

> 
> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
> think it works much the same in most respects. But obviously some semantics
> is lost when the server no longer is aware of the character set used.

Correct.

> 
>> Since the index behaviour obviously depends on the collation, would
>> building an own collation which does not PADSPACE be an option? I have
> 
> That would be interesting, actually. I don't know what support there is for
> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?

We don't have NO PAD collations yet.
I don't remember that anybody ever asked for this before.

>From a glance, this should not be too hard to implement.

You previously wrote you need utf8.
So there are three options to make a new collation,
based on one of these existing collations:

- utf8_general_ci
- utf8_unicode_ci
- utf8_bin

but with NO PAD characteristics.

Suppose you need utf8_general_nopad_ci
(i.e. based on utf8_general_ci)


utf8_general_ci is implemented in strings/ctype-utf8.c

A new collation handler should be added, similar to this one:

static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
{
    NULL,               /* init */
    my_strnncoll_utf8_general_ci,
    my_strnncollsp_utf8_general_ci,
    my_strnxfrm_unicode,
    my_strnxfrmlen_unicode,
    my_like_range_mb,
    my_wildcmp_utf8,
    my_strcasecmp_utf8,
    my_instr_mb,
    my_hash_sort_utf8,
    my_propagate_complex
};


but these three virtual functions must be redefined to new
similar functions that do not ignore trailing spaces:

- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
- my_hash_sort_utf8              - this is used for HASH indexes
- my_strnxfrm_unicode            - this is used for filesort
                                   (non-indexed ORDER BY)

All other functions can be reused from the existing PAD SPACE collation.



Then a new "struct charset_info_st" should be defined,
similar to my_charset_utf8_general_ci.

Sounds like a few hours of work.


But then thorough testing will be needed,
which will be the most time consuming part.


Also, for consistency, it's worthy to implement
at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
and then eventually NO PAD collations for all other character sets.

> 
>  - Kristian.
> 


Follow ups

References