maria-developers team mailing list archive

Thread
Date
Re: [Maria-discuss] Collations, trailing spaces and unique indexes

To: Binarus <lists@xxxxxxxxxx>, maria-developers@xxxxxxxxxxxxxxxxxxx
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Thu, 17 Mar 2016 08:46:09 +0400
In-reply-to: <56E9756D.8010300@binarus.de>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
Hi Binarus,


On 03/16/2016 07:02 PM, Binarus wrote:
> Alexander,
> 
> I'd like to continue our conversation from Maria-discuss.

Sure.

> 
>>> - I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be
>>> (I don't know what a weight is, so I currently try to understand what the function does at all).
>>
>> This function is used to create sort keys for non-indexed ORDER BY,
>> for these cases:
>>
>> - ORDER BY on an expression
>> - ORDER BY on a column that does not have an index
>>
>>
>> The idea is exactly the same with the C function strxfrm.
>> See "man strxfrm".
>>
>> The code implements non-indexed ORDER BY in filesort.cc
>> as follows:
>>
>> 1. It calls *_strnxfrm_* functions for all records and converts
>> CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.
>>
>> 2 . Then executes binary sorting on these keys.
>>
>> By the way, fixing this function might be tricky.
>>
>>
>> Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
>> character.
>> The NO PAD version will need to pad the tail using a weight which is
>> less than the weight of the smallest possible character.
>>
>> This should be easy for UCA bases collations (e.g.
>> utf8_unicode_nopad_ci), because the smallest possible
>> character in UCA collations is "U+0009 HORIZONTAL TABULATION",
>> and its weight is 0x0201. So we can just pad the sort key
>> using a smaller value 0x0200.
>>
>>
>> But I'm not sure yet what to do with 8-bit collations,
>> which usually use 0x00 as weight for the smallest character.
>> So we don't have a smaller value.
>> There are two options here:
>>
>> 1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa'
>> will  have unpredictable order when doing ORDER BY without an index
>> (where <min> is the smallest possible character in the collation).
>>
>> As the smallest character in non-UCA collations is usually
>> "U+0000 NULL", this will mean that 'aaa\0' and just 'aaa'
>> will have unpredictable order.
>>
>> 2. Reserve extra bytes at the end of the key, to store the true length, so
>> - 'aaa\0' will have the key '4141410004'
>> - 'aaa'   will have the key '4141410003', and therefore will always
>>    be sorted before 'aaa\0'.
>>
>> I'm inclined towards #2, to have consistent ORDER BY behavior
>> with and without indexes.
> 
> Wow, thanks for explaining. I now have learned that I probably can't assist you in solving that problem (I haven't been in unicode, weights, sorting and character sets very deep until now).
> 
> But if it helps, you eventually could break the problem into two steps:
> 
> 1) Fix the problem with unique indexes
> 
> 2) Fix ordering for any possible situation
> 
> The problems we have encountered in real life so far are all related to 1), i.e. we would be happy if unique indexes would consider 'a' and 'a ' being different values. In contrast, ordering is not that important for us. I don't know what others are thinking about that, though.
> 

Thanks for the explanation. That makes understand your needs better.

When we add new collations into the main tree, I prefer to have
both steps already addressed, to have consistent collation behavior.
But you're right that in a development tree it can be done in two
separate steps, for simplicity.


>>>
>>> - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
>>
>> The function name is just "my_strnncollsp_utf8".
> 
> OK, thank you very much. I must have misunderstood something.
> 
>>> Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
>>
>> Can you please clarify which lines do you mean?
> 
> I am referring to the file ctype-utf8.c, line 5698. There is:
> 
> #ifndef VARCHAR_WITH_DIFF_ENDSPACE_ARE_DIFFERENT_FOR_UNIQUE
>   diff_if_only_endspace_difference= 0;
> #endif
> 
> I haven't been able yet to thoroughly analyze where and how this is used, but at least somebody obviously already had some thoughts whether or not trailing spaces should make a difference in UNIQUEs :-)


Thanks. This is a kind of dead code. You correctly guessed that it was
supposed to switch between NO PAD and PAD SPACE, but the exact behavior
was supposed to be chosen and compile time. Then we understood that
we want NO PAD and PAD SPACE collations to co-exist, so this approach
with conditional compilation appeared not to be a way to go.

So this code has never been used in the NO PAD mode actually.
We should refactor it when working on NO PAD collations,
by removing #ifndef and #else and moving the relevant pieces into
different functions, for PAD SPACE and NO PAD handlers.


> 
> Regards,
> 
> Binarus
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp
>
References

Re: [Maria-discuss] Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-16