maria-discuss team mailing list archive
-
maria-discuss team
-
Mailing list archive
-
Message #03458
Re: Collations, trailing spaces and unique indexes
Hi Binarus,
On 03/11/2016 10:19 PM, Binarus wrote:
> On 11.03.2016 15:45, Alexander Barkov wrote:
>> FYI, I have added a new task for this:
>>
>> https://jira.mariadb.org/browse/MDEV-9711
>>
>
> Alexander,
>
> I couldn't resist taking a quick look into the sources.
>
> - I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there.
>
> - I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be
> (I don't know what a weight is, so I currently try to understand what the function does at all).
This function is used to create sort keys for non-indexed ORDER BY,
for these cases:
- ORDER BY on an expression
- ORDER BY on a column that does not have an index
The idea is exactly the same with the C function strxfrm.
See "man strxfrm".
The code implements non-indexed ORDER BY in filesort.cc
as follows:
1. It calls *_strnxfrm_* functions for all records and converts
CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.
2 . Then executes binary sorting on these keys.
By the way, fixing this function might be tricky.
Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
character.
The NO PAD version will need to pad the tail using a weight which is
less than the weight of the smallest possible character.
This should be easy for UCA bases collations (e.g.
utf8_unicode_nopad_ci), because the smallest possible
character in UCA collations is "U+0009 HORIZONTAL TABULATION",
and its weight is 0x0201. So we can just pad the sort key
using a smaller value 0x0200.
But I'm not sure yet what to do with 8-bit collations,
which usually use 0x00 as weight for the smallest character.
So we don't have a smaller value.
There are two options here:
1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa'
will have unpredictable order when doing ORDER BY without an index
(where <min> is the smallest possible character in the collation).
As the smallest character in non-UCA collations is usually
"U+0000 NULL", this will mean that 'aaa\0' and just 'aaa'
will have unpredictable order.
2. Reserve extra bytes at the end of the key, to store the true length, so
- 'aaa\0' will have the key '4141410004'
- 'aaa' will have the key '4141410003', and therefore will always
be sorted before 'aaa\0'.
I'm inclined towards #2, to have consistent ORDER BY behavior
with and without indexes.
>
> - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
The function name is just "my_strnncollsp_utf8".
>
> Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
Can you please clarify which lines do you mean?
>
> Should we continue our conversation on the developer mailing list?
Sure.
>
> Regards,
>
> Binarus
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help : https://help.launchpad.net/ListHelp
>
Follow ups
References
-
Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
-
Re: Collations, trailing spaces and unique indexes
From: Kristian Nielsen, 2016-03-11
-
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-11
-
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-11
-
Re: Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11