maria-discuss team mailing list archive

Thread
Date

Re: Collations, trailing spaces and unique indexes

To: maria-discuss@xxxxxxxxxxxxxxxxxxx
From: Binarus <lists@xxxxxxxxxx>
Date: Fri, 11 Mar 2016 16:56:36 +0100
In-reply-to: <56E2A9F1.3060002@mariadb.org>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

Hello Kristian, hello Alexander,

On 11.03.2016 12:20, Alexander Barkov wrote:
> So we follow the PAD SPACE requirement for default collations here.

I think it's reasonable that the default collations are implemented according to the standard. Nevertheless, the standard enforces an exceptionally stupid policy here (IMHO), and therefore I think we are in urgent need of additional (non-default) collations which solve that problem (and what you have written below makes me hope ...).

>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>> think it works much the same in most respects. But obviously some semantics
>> is lost when the server no longer is aware of the character set used.
> 
> Correct.

I'll go that way, sticking with MySQL, if the problem won't be solved by new collations in MariaDB.

>>
>>> Since the index behaviour obviously depends on the collation, would
>>> building an own collation which does not PADSPACE be an option? I have
>>
>> That would be interesting, actually. I don't know what support there is for
>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
> 
> We don't have NO PAD collations yet.

Thanks for making this clear; I haven't found something about it in the documents. By the way, the document sections which deal with the char data types could eventually mention these pitfalls (like the MySQL documentation) :-)

> I don't remember that anybody ever asked for this before.

I wonder why, given the dozens (or hundreds?) of threads and tutorials around the web which try to explain how to circumvent these issues (none of them provides a general solution besides switching to ...BINARY).

> [...]
> Sounds like a few hours of work.

This is very good news. I will now try to give some arguments why this really should be done.

1) Avoid programming mistakes

People might be programming for decades, but when not being database experts and thrown into a web development project, even those experienced programmers won't come to the idea that a database which returns trailing spaces in the result set of SELECT statements throws away these trailing spaces when building the index or when comparing (e.g. for evaluation the conditions of the WHERE CLAUSE). This is completely illogical and dangerous and makes debugging substantially more difficult.

2) Developers (and not DBMSs) should be in control of what happens to trailing spaces

A DBMS must not silently change data, nor for comparing nor for other purposes. It is solely the decision of the application logic if trailing spaces are acceptable or desired in data values (possibly entered by users), but not the decision of the DBMS. Nothing else than the application logic has to decide what to do with trailing spaces (trim them, keep them, reject user input, return error messages etc.), and the DBMS has to store, retrieve, compare and index the results of the application logic as-is. Period.

However, there is no argument against additional and optional functions or configuration flags of the DBMS which allow easy trimming or other (automatic and silent) transformation of the values and all sorts of weird compare modes, if these are optional. But in the first place, there must also be a mode / column type for every sort of data, notably for strings, in which data is stored, compared and indexed as-is.

We already have the binary data types which in this respect behave like desired, but all string functions fail with binary data values or behave surprisingly / impractically, so nearly all string functions (which are very important in many applications) are lost for binary data type columns.

3) There are use cases for string data with trailing spaces

I have seen some examples of trailing spaces, some of them useless, but some quite meaningful. In every case, the application logic relied on a string with trailing space being not equal to the same string without the trailing space. In some cases, the developers had not thought about the issues the trailing spaces could cause, i.e. their programs were faulty; in the other cases, they have put a lot of effort and used the weirdest tricks and index combinations to circumvent the problems. I can't remember a case where they have switched to ...BINARY data types.

Two of the funniest cases (just because it's Friday - skip the next two sections if you don't want to laugh):

Some application actually stored an internal status bit at the end of existing user input, probably because the programmers were in a hurry or had no administrative access to the MySQL server, i.e. they couldn't create a new status column in the respective table. Now how could you save a status bit within an existing string without making it visible to the users? The answer is of course: Append a space at the end of the string to set the status, remove it to clear the status. Of course, the respective column had to be UNIQUE in the sense that two identical user inputs should be allowed if the "status bit" differed ...

Some other application generated session keys in string form and used a part of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars like $% and so on, including the space). Of course, as chance would have it, there could be a trailing space at the end of the session key. I can't remember if this led to problems in this case, but theoretically, it surely could have.

Unfortunately, my own use case is too difficult to explain here.

To summarize, there are all sorts of serious pitfalls which might be very difficult to debug and very difficult to foresee if the DBMS ignores trailing spaces when storing, retrieving, comparing or indexing strings, so I would be very grateful, and I would immediately switch to MariaDB, if appropriate collations would be added there.

> But then thorough testing will be needed,
> which will be the most time consuming part.

Probably, but I have a feeling that *not* including trailing spaces in comparisons / collations is much more difficult than including them. A trailing space is just a char as every other char, isn't it? So I suppose the structure of the respective functions for _nopad_ collations will be much less complicated than the structure of the existing functions. Given that, the testing eventually wouldn't consume too much time ...

> Also, for consistency, it's worthy to implement
> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
> and then eventually NO PAD collations for all other character sets.

Personally, I would be totally happy if I had the three new utf8 collations (I probably will never use anything else than utf8 for strings).

Thank you very much!

Binarus

Follow ups

Re: Collations, trailing spaces and unique indexes
From: Ian Gilfillan, 2016-03-16
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-16

References

Collations, trailing spaces and unique indexes
From: Binarus, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Kristian Nielsen, 2016-03-11
Re: Collations, trailing spaces and unique indexes
From: Alexander Barkov, 2016-03-11