← Back to team overview

maria-discuss team mailing list archive

Re: Limited Unicode Support?

 

Hi Björn,

The time for more than 4 bytes in UTF8 will never come, and even the emojis expand so that more than 1112064 “characters”  , new encoding will not be called UTF8 anymore, and I doubt it will even be called Unicode. 

UTF8 is not up to 7 characters. While the encoding scheme with leading/trailing bytes could allow for more  than 4 bytes, this was explicitly clarified and forbidden in the RFC3629 https://tools.ietf.org/html/rfc3629#section-4 , along with encoding of  unpaired  “surrogate” characters from UTF16, so basically UTF8 can encode everything in UTF16, and not more than that.

The utf8mb4 story is that - there was a discussion IIRC  during MySQL 5.5 development, whether to continue using UTF8 name or whether to create a new name, for the Unicode (2.0+) conforming charset. As you noticed , traditional MySQL’s version of UTF8 is castrated. On the other hand, reusing a name for something different could possibly lead to compatibility problems with existing applications. The conservative decision was for the new name for the real (in Unicode sense) UTF8. The “utf8mb4” name is not pretty, confusing, but no compatibility problems were reported.


From: Björn Keil
Sent: Friday, 11 October 2019 12:10
To: maria-discuss@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Maria-discuss] Limited Unicode Support?

Thanks for the replies. I've tried to just replace all occurrences of "utf8" in my example with "utf8mb4" and it works.

Inconveniently this will require major conversations and down times for my application, but at least I know what I must do to make it work.

However, the "mb4" sounds a little suspicious, though. While there are no sufficiently high numbered Unicode Points yet that would make such a measure necessary, the UTF-8 encoding allows for up to seven byte long characters, if I am not mistaken. Does utf8mb4 allow for more than four byte long characters if in and when the time comes?

Am Do., 10. Okt. 2019 um 17:18 Uhr schrieb Diego Dupin <diego.dupin@xxxxxxxxxxx>:
Hi björn, 

🙋 is  a 4 bytes encoded character (0xF0 0x9F 0x99 0x8B).

"utf8" is a 3-Byte UTF-8 Unicode encoding. 
You have to configure charset "utf8mb4" that permits full utf8 support. 
https://jira.mariadb.org/browse/MDEV-8334 in 10.5 is the first step to makes utf8mb4 default for 'utf8'.

regards,
diego.


On Thu, Oct 10, 2019 at 3:53 PM Björn Keil <schattenkeil@xxxxxxxxxxxxxx> wrote:
Hello,

I hope this is the proper mailing list to ask such questions, I apologise if it isn't.

I am having some problems with unusual Unicode characters in my MariaDB database.

$ mariadb --version
mariadb  Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
$ sudo ./mariadb.php
[sudo] Passwort für bjoern: 
Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?'
Inserted: '🙋 Huhu. wie geht es dir?'
Returned: '???? Huhu. wie geht es dir?'

SHOW VARIABLES LIKE 'character%':
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why.

I've attached the PHP code for the example.

I would be most grateful for any suggestion.

Regards,
Björn Keil
_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp


References