← Back to team overview

maria-discuss team mailing list archive

Re: Collations, trailing spaces and unique indexes

 

Hi Binarus,

On 03/11/2016 10:19 PM, Binarus wrote:
> On 11.03.2016 15:45, Alexander Barkov wrote:
>> FYI, I have added a new task for this:
>>
>> https://jira.mariadb.org/browse/MDEV-9711
>>
> 
> Alexander,
> 
> I couldn't resist taking a quick look into the sources.
> 
> - I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there.
> 
> - I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be
> (I don't know what a weight is, so I currently try to understand what the function does at all).

This function is used to create sort keys for non-indexed ORDER BY,
for these cases:

- ORDER BY on an expression
- ORDER BY on a column that does not have an index


The idea is exactly the same with the C function strxfrm.
See "man strxfrm".

The code implements non-indexed ORDER BY in filesort.cc
as follows:

1. It calls *_strnxfrm_* functions for all records and converts
CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.

2 . Then executes binary sorting on these keys.

By the way, fixing this function might be tricky.


Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
character.
The NO PAD version will need to pad the tail using a weight which is
less than the weight of the smallest possible character.

This should be easy for UCA bases collations (e.g.
utf8_unicode_nopad_ci), because the smallest possible
character in UCA collations is "U+0009 HORIZONTAL TABULATION",
and its weight is 0x0201. So we can just pad the sort key
using a smaller value 0x0200.


But I'm not sure yet what to do with 8-bit collations,
which usually use 0x00 as weight for the smallest character.
So we don't have a smaller value.
There are two options here:

1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa'
will  have unpredictable order when doing ORDER BY without an index
(where <min> is the smallest possible character in the collation).

As the smallest character in non-UCA collations is usually
"U+0000 NULL", this will mean that 'aaa\0' and just 'aaa'
will have unpredictable order.

2. Reserve extra bytes at the end of the key, to store the true length, so
- 'aaa\0' will have the key '4141410004'
- 'aaa'   will have the key '4141410003', and therefore will always
   be sorted before 'aaa\0'.

I'm inclined towards #2, to have consistent ORDER BY behavior
with and without indexes.

> 
> - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?


The function name is just "my_strnncollsp_utf8".


> 
> Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?

Can you please clarify which lines do you mean?

> 
> Should we continue our conversation on the developer mailing list?

Sure.

> 
> Regards,
> 
> Binarus
> 
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
> 


Follow ups

References