← Back to team overview

maria-developers team mailing list archive

A few character set incompatible changes in 10.2

 

Hello Ian,

In 10.2 we made a few changes under terms of these bug reports:

MDEV-9874 LOAD XML INFILE does not handle well broken multi-byte characters
MDEV-9823 LOAD DATA INFILE silently truncates incomplete byte sequences
MDEV-9842 LOAD DATA INFILE does not work well with a TEXT column when using sjis
MDEV-9811 LOAD DATA INFILE does not work well with gbk in some cases
MDEV-9824 LOAD DATA does not work with multi-byte strings in LINES TERMINATED BY when IGNORE is specified

The idea is that the LOAD FILE behavior is now more consistent with INSERT/UPDATE behavior, to store as much data as possible.


When some broken byte sequence is found, now LOAD data replaces broken bytes to question marks and keeps loading the value. In older versions
LOAD truncated the value on the leftmost broken byte.


So suppose I create a file with these bytes:

SELECT CONCAT('aaa',0xF09F988E,'bbb') INTO OUTFILE '/tmp/test.txt';

where {{0xF09F988E}} is UTF8MB4 encoding for the character "U+1F60E SMILING FACE WITH SUNGLASSES".

and now erroneously load it as a 3-byte utf8:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8);
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET utf8;
SHOW WARNINGS;
SELECT * FROM t1;

(notice CHARACTER SET utf8 instead of CHARACTER SET utf8mb4 in LOAD).


In 5.5 the above script would return:

+---------+------+-------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x8Ebb...' for column 'a' at row 1 |
+---------+------+-------------------------------------------------------------------------+
+------+
| a    |
+------+
| aaa  |
+------+


In 10.2 it returns the same warning, but loads more data:
+------------+
| a          |
+------------+
| aaa????bbb |
+------------+


Valerii suggests that we document these changes more precisely, mentioning that these are actually incompatible changes!



The sad thing is that there is even yet another different behavior in 10.0.26:
MDEV-11217 Regression: LOAD DATA INFILE started to fail with an error
We're currently thinking what to do with that.