maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #10128
Re: MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character
Hi, Alexander!
On Nov 25, Alexander Barkov wrote:
> commit 3b2701d33630857908d236686ad3df18b5ffddc6
> Author: Alexander Barkov <bar@xxxxxxxxxxx>
> Date: Fri Nov 25 17:16:30 2016 +0400
>
> MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character
>
> Partially backporting MDEV-9874 from 10.2 to 10.0
>
> READ_INFO::read_field() raised the ER_INVALID_CHARACTER_STRING error
> when reading an escape character followed by a multi-byte character.
>
> Raising wellformedness errors in READ_INFO::read_field() was wrong,
> because the main goal of READ_INFO::read_field() is to *unescape* the
> data which was presumably escaped using mysql_real_escape_string(),
> using the same character set with the one specified in
> "LOAD DATA INFILE ... CHARACTER SET ..." (or assumed by default).
>
> During LOAD DATA, multi-byte characters are not always scanned as a single
> entity! In case of escaped data, parts of a multi-byte character can be
> scanned on different loop iterations. So the old code erroneously tested
> welformedness in the middle of a multi-byte character.
>
> Moreover, the data after unescaping can go into a BLOB field, not a text field.
> Wellformedness tests are meaningless in this case.
>
> Ater this patch, wellformedness is only checked later, during
> Field::store(str,length,cs) time. The loop that scans bytes only
> makes sure to revert the changes made by mysql_real_escape_string().
>
> Note, in some cases users can supply data which did not really go through
> mysql_real_escape_string() and was escaped by some other means,
> or was not escaped at all. The file reported in this MDEV contains
> the string "\ä", which is an example of such improperly escaped data, as
> - either there should be two backslashes: "\\ä"
> - or there should be no backslashes at all: "ä"
> mysql_real_escape_string() could not generate "\ä".
good comment, thanks
> diff --git a/sql/sql_load.cc b/sql/sql_load.cc
> index af4b251..6a473ce 100644
> --- a/sql/sql_load.cc
> +++ b/sql/sql_load.cc
> @@ -79,6 +79,81 @@ class READ_INFO {
> NET *io_net;
> int level; /* for load xml */
>
> +
> +#if MYSQL_VERSION_ID >= 100200
> +#error This 10.0 and 10.1 specific fix should be removed in 10.2.
> +#error Fix read_mbtail() to use my_charlen() instead of my_charlen_tmp()
I don't quite undrestand that.
Should the compete patch be removed when merging with 10.2? Or only
my_charlen_tmp be replaced with my_charlen ?
10.2 already has read_mbtail() that uses my_charlen(), so it seems as if
the whole patch should be null-merged into 10.2.
But then, why didn't you put the whole new read_mbtail() under #else?
> +#else
> + int my_charlen_tmp(CHARSET_INFO *cs, const char *str, const char *end)
> + {
> + my_wc_t wc;
> + return cs->cset->mb_wc(cs, &wc, (const uchar *) str, (const uchar *) end);
> + }
> +#endif
> +
> + /**
> + Read a tail of a multi-byte character.
> + The first byte of the character is assumed to be already
> + read from the file and appended to "str".
> +
> + @returns true - if EOF happened unexpectedly
> + @returns false - no EOF happened: found a good multi-byte character,
> + or a bad byte sequence
> +
> + Note:
> + The return value depends only on EOF:
> + - read_mbtail() returns "false" is a good character was read, but also
> + - read_mbtail() returns "false" if an incomplete byte sequence was found
> + and no EOF happened.
> +
> + For example, suppose we have an ujis file with bytes 0x8FA10A, where:
> + - 0x8FA1 is an incomplete prefix of a 3-byte character
> + (it should be [8F][A1-FE][A1-FE] to make a full 3-byte character)
> + - 0x0A is a line demiliter
> + This file has some broken data, the trailing [A1-FE] is missing.
> +
> + In this example it works as follows:
> + - 0x8F is read from the file and put into "data" before the call
> + for read_mbtail()
> + - 0xA1 is read from the file and put into "data" by read_mbtail()
> + - 0x0A is kept in the read queue, so the next read iteration after
> + the current read_mbtail() call will normally find it and recognize as
> + a line delimiter
> + - the current call for read_mbtail() returns "false",
> + because no EOF happened
> + */
> + bool read_mbtail(String *str)
> + {
> + int chlen;
> + if ((chlen= my_charlen_tmp(read_charset, str->end() - 1, str->end())) == 1)
> + return false; // Single byte character found
> + for (uint32 length0= str->length() - 1 ; MY_CS_IS_TOOSMALL(chlen); )
> + {
> + int chr= GET;
> + if (chr == my_b_EOF)
> + {
> + DBUG_PRINT("info", ("read_mbtail: chlen=%d; unexpected EOF", chlen));
> + return true; // EOF
> + }
> + str->append(chr);
> + chlen= my_charlen_tmp(read_charset, str->ptr() + length0, str->end());
> + if (chlen == MY_CS_ILSEQ)
> + {
> + /**
> + It has been an incomplete (but a valid) sequence so far,
> + but the last byte turned it into a bad byte sequence.
> + Unget the very last byte.
> + */
> + str->length(str->length() - 1);
> + PUSH(chr);
> + DBUG_PRINT("info", ("read_mbtail: ILSEQ"));
> + return false; // Bad byte sequence
> + }
> + }
> + DBUG_PRINT("info", ("read_mbtail: chlen=%d", chlen));
> + return false; // Good multi-byte character
> + }
> +
> public:
> bool error,line_cuted,found_null,enclosed;
> uchar *row_start, /* Found row starts here */
> @@ -1510,7 +1585,8 @@ int READ_INFO::read_field()
>
> for (;;)
> {
> - while ( to < end_of_buff)
> + // Make sure we have enough space for the longest multi-byte character.
> + while ( to + read_charset->mbmaxlen < end_of_buff)
> {
> chr = GET;
> if (chr == my_b_EOF)
> @@ -1598,52 +1674,27 @@ int READ_INFO::read_field()
> }
> }
> #ifdef USE_MB
> - uint ml= my_mbcharlen(read_charset, chr);
> - if (ml == 0)
> - {
> - *to= '\0';
> - my_error(ER_INVALID_CHARACTER_STRING, MYF(0),
> - read_charset->csname, buffer);
> - error= true;
> - return 1;
> - }
> -
> - if (ml > 1 &&
> - to + ml <= end_of_buff)
> - {
> - uchar* p= to;
> - *to++ = chr;
> -
> - for (uint i= 1; i < ml; i++)
> - {
> - chr= GET;
> - if (chr == my_b_EOF)
> - {
> - /*
> - Need to back up the bytes already ready from illformed
> - multi-byte char
> - */
> - to-= i;
> - goto found_eof;
> - }
> - *to++ = chr;
> - }
> - if (my_ismbchar(read_charset,
> - (const char *)p,
> - (const char *)to))
> - continue;
> - for (uint i= 0; i < ml; i++)
> - PUSH(*--to);
> - chr= GET;
> - }
> - else if (ml > 1)
> - {
> - // Buffer is too small, exit while loop, and reallocate.
> - PUSH(chr);
> - break;
> - }
> #endif
> *to++ = (uchar) chr;
> +#if MYSQL_VERSION_ID >= 100200
> +#error This 10.0 and 10.1 specific fix should be removed in 10.2
> +#else
> + if (my_mbcharlen(read_charset, (uchar) chr) > 1)
> + {
> + /*
> + A known MBHEAD found. Try to scan the full multi-byte character.
> + Otherwise, a possible following second byte 0x5C would be
> + mis-interpreted as an escape on the next iteration.
> + (Important for big5, gbk, sjis, cp932).
> + */
> + String tmp((char *) to - 1, read_charset->mbmaxlen, read_charset);
> + tmp.length(1);
> + bool eof= read_mbtail(&tmp);
> + to+= tmp.length() - 1;
> + if (eof)
> + goto found_eof;
> + }
> +#endif
> }
> /*
> ** We come here if buffer is too small. Enlarge it and continue
Regards,
Sergei
Chief Architect MariaDB
and security@xxxxxxxxxxx
Follow ups
References