maria-discuss team mailing list archive

Thread
Date
Re: Collations, trailing spaces and unique indexes

To: Binarus <lists@xxxxxxxxxx>, maria-discuss@xxxxxxxxxxxxxxxxxxx
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Wed, 16 Mar 2016 09:23:40 +0400
In-reply-to: <56E2EAB4.5090700@binarus.de>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
  Hi Binarus,

Thanks for a detailed answer.
It seems that NO PAD collations are going to be a very welcome feature.

Please see my replies and questions below.


On 03/11/2016 07:56 PM, Binarus wrote:
> Hello Kristian, hello Alexander,
> 
> On 11.03.2016 12:20, Alexander Barkov wrote:
>> So we follow the PAD SPACE requirement for default collations here.
> 
> I think it's reasonable that the default collations are implemented according to the standard.
> Nevertheless, the standard enforces an exceptionally stupid policy here (IMHO),

Using PAD SPACE by default makes it safe to ALTER TABLE and switch:

- from VARCHAR(N) to CHAR(M)
- from CHAR(N)    to VARCHAR(M)

where M>=N.

I think that was the main reason for make PAD SPACE by default.
So I agree with the standard here.


> and therefore I think we are in urgent need of additional (non-default) collations which solve that problem (and what you have written below makes me hope ...).
> 
>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>> think it works much the same in most respects. But obviously some semantics
>>> is lost when the server no longer is aware of the character set used.
>>
>> Correct.
>  
> I'll go that way, sticking with MySQL, if the problem won't be solved by new collations in MariaDB.
> 
>>>
>>>> Since the index behaviour obviously depends on the collation, would
>>>> building an own collation which does not PADSPACE be an option? I have
>>>
>>> That would be interesting, actually. I don't know what support there is for
>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>
>> We don't have NO PAD collations yet.
> 
> Thanks for making this clear; I haven't found something about it in the documents. By the way, the document sections which deal with the char data types could eventually mention these pitfalls (like the MySQL documentation) :-)

Sure, we'll mention this clearer when we add NO PAD collations.

> 
>> I don't remember that anybody ever asked for this before.
> 
> I wonder why, given the dozens (or hundreds?) of threads and tutorials around the web which try to explain how to circumvent these issues (none of them provides a general solution besides switching to ...BINARY).
>  
>> [...]
>> Sounds like a few hours of work.
> 
> This is very good news. I will now try to give some arguments why this really should be done.
> 
> 1) Avoid programming mistakes
> 
> People might be programming for decades, but when not being database experts and thrown into a web development project, even those experienced programmers won't come to the idea that a database which returns trailing spaces in the result set of SELECT statements throws away these trailing spaces when building the index or when comparing (e.g. for evaluation the conditions of the WHERE CLAUSE). This is completely illogical and dangerous and makes debugging substantially more difficult.
> 
> 2) Developers (and not DBMSs) should be in control of what happens to trailing spaces

Exactly.
The DBMS should supply a full variety of useful collations,
so the users have enough flexibility to choose the one that suites best.
So I like the idea of new NO PAD collations very much.

> 
> A DBMS must not silently change data, nor for comparing nor for other purposes. It is solely the decision of the application logic if trailing spaces are acceptable or desired in data values (possibly entered by users), but not the decision of the DBMS. Nothing else than the application logic has to decide what to do with trailing spaces (trim them, keep them, reject user input, return error messages etc.), and the DBMS has to store, retrieve, compare and index the results of the application logic as-is. Period.
> 
> However, there is no argument against additional and optional functions or configuration flags of the DBMS which allow easy trimming or other (automatic and silent) transformation of the values and all sorts of weird compare modes, if these are optional. But in the first place, there must also be a mode / column type for every sort of data, notably for strings, in which data is stored, compared and indexed as-is.
> 
> We already have the binary data types which in this respect behave like desired, but all string functions fail with binary data values or behave surprisingly / impractically, so nearly all string functions (which are very important in many applications) are lost for binary data type columns.
> 
> 3) There are use cases for string data with trailing spaces
> 
> I have seen some examples of trailing spaces, some of them useless, but some quite meaningful. In every case, the application logic relied on a string with trailing space being not equal to the same string without the trailing space. In some cases, the developers had not thought about the issues the trailing spaces could cause, i.e. their programs were faulty; in the other cases, they have put a lot of effort and used the weirdest tricks and index combinations to circumvent the problems. I can't remember a case where they have switched to ...BINARY data types.
> 
> Two of the funniest cases (just because it's Friday - skip the next two sections if you don't want to laugh):
> 
> Some application actually stored an internal status bit at the end of existing user input, probably because the programmers were in a hurry or had no administrative access to the MySQL server, i.e. they couldn't create a new status column in the respective table. Now how could you save a status bit within an existing string without making it visible to the users? The answer is of course: Append a space at the end of the string to set the status, remove it to clear the status. Of course, the respective column had to be UNIQUE in the sense that two identical user inputs should be allowed if the "status bit" differed ...
> 
> Some other application generated session keys in string form and used a part of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars like $% and so on, including the space). Of course, as chance would have it, there could be a trailing space at the end of the session key. I can't remember if this led to problems in this case, but theoretically, it surely could have.
> 
> Unfortunately, my own use case is too difficult to explain here.

Thanks for use examples. I liked the one about "status bit" very much :)

> 
> To summarize, there are all sorts of serious pitfalls which might be very difficult to debug and very difficult to foresee if the DBMS ignores trailing spaces when storing, retrieving, comparing or indexing strings, so I would be very grateful, and I would immediately switch to MariaDB, if appropriate collations would be added there.
>  
>> But then thorough testing will be needed,
>> which will be the most time consuming part.
> 
> Probably, but I have a feeling that *not* including trailing spaces in comparisons / collations is much more difficult than including them.
> A trailing space is just a char as every other char, isn't it? So I suppose the structure of the respective functions for _nopad_ collations will be much less complicated than the structure of the existing functions.

Yes. I think so. The idea is just to disable the code that ignores
trailing spaces.


> Given that, the testing eventually wouldn't consume too much time ...

We'll need to cover thoroughly many thing, including but not limited to:

- DISTINCT
- UNION
- ORDER BY with indexes
- ORDER BY without indexes
- GROUP BY
- Mixing NO PAD and PAD collations
- Function output
- GROUP_CONCAT(DISTINCT)


>  
>> Also, for consistency, it's worthy to implement
>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>> and then eventually NO PAD collations for all other character sets.
> 
> Personally, I would be totally happy if I had the three new utf8 collations (I probably will never use anything else than utf8 for strings).

We're interested in implementing NO PAD counterparts for the default
and the _bin collations under terms of Google Summer Of Code 2016
(GSoC). By this will take a few months. I guess you need this more
urgently.

Question: would utf8_unicode_nopad_ci work for you for now?
We could implement it separately, without having to wait until
the end of GSoC.

> 
> Thank you very much!
> 
> Binarus
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
>
References

Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Kristian Nielsen, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11