maria-developers team mailing list archive

Thread
Date
Re: MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character

To: Alexander Barkov <bar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Mon, 28 Nov 2016 14:47:53 +0100
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <58383A0B.3020902@mariadb.org>
User-agent: Mutt/1.5.24 (2015-08-30)
Hi, Alexander!

On Nov 25, Alexander Barkov wrote:
> commit 3b2701d33630857908d236686ad3df18b5ffddc6
> Author: Alexander Barkov <bar@xxxxxxxxxxx>
> Date:   Fri Nov 25 17:16:30 2016 +0400
> 
>     MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character
>     
>     Partially backporting MDEV-9874 from 10.2 to 10.0
>     
>     READ_INFO::read_field() raised the ER_INVALID_CHARACTER_STRING error
>     when reading an escape character followed by a multi-byte character.
>     
>     Raising wellformedness errors in READ_INFO::read_field() was wrong,
>     because the main goal of READ_INFO::read_field() is to *unescape* the
>     data which was presumably escaped using mysql_real_escape_string(),
>     using the same character set with the one specified in
>     "LOAD DATA INFILE ... CHARACTER SET ..." (or assumed by default).
>     
>     During LOAD DATA, multi-byte characters are not always scanned as a single
>     entity! In case of escaped data, parts of a multi-byte character can be
>     scanned on different loop iterations. So the old code erroneously tested
>     welformedness in the middle of a multi-byte character.
>     
>     Moreover, the data after unescaping can go into a BLOB field, not a text field.
>     Wellformedness tests are meaningless in this case.
>     
>     Ater this patch, wellformedness is only checked later, during
>     Field::store(str,length,cs) time. The loop that scans bytes only
>     makes sure to revert the changes made by mysql_real_escape_string().
>     
>     Note, in some cases users can supply data which did not really go through
>     mysql_real_escape_string() and was escaped by some other means,
>     or was not escaped at all. The file reported in this MDEV contains
>     the string "\ä", which is an example of such improperly escaped data, as
>     - either there should be two backslashes:   "\\ä"
>     - or there should be no backslashes at all: "ä"
>     mysql_real_escape_string() could not generate "\ä".

good comment, thanks

> diff --git a/sql/sql_load.cc b/sql/sql_load.cc
> index af4b251..6a473ce 100644
> --- a/sql/sql_load.cc
> +++ b/sql/sql_load.cc
> @@ -79,6 +79,81 @@ class READ_INFO {
>    NET *io_net;
>    int level; /* for load xml */
>  
> +
> +#if MYSQL_VERSION_ID >= 100200
> +#error This 10.0 and 10.1 specific fix should be removed in 10.2.
> +#error Fix read_mbtail() to use my_charlen() instead of my_charlen_tmp()

I don't quite undrestand that.
Should the compete patch be removed when merging with 10.2? Or only
my_charlen_tmp be replaced with my_charlen ?

10.2 already has read_mbtail() that uses my_charlen(), so it seems as if
the whole patch should be null-merged into 10.2.
But then, why didn't you put the whole new read_mbtail() under #else? 

> +#else
> +  int my_charlen_tmp(CHARSET_INFO *cs, const char *str, const char *end)
> +  {
> +    my_wc_t wc;
> +    return cs->cset->mb_wc(cs, &wc, (const uchar *) str, (const uchar *) end);
> +  }
> +#endif
> +
> +  /**
> +    Read a tail of a multi-byte character.
> +    The first byte of the character is assumed to be already
> +    read from the file and appended to "str".
> +
> +    @returns  true  - if EOF happened unexpectedly
> +    @returns  false - no EOF happened: found a good multi-byte character,
> +                                       or a bad byte sequence
> +
> +    Note:
> +    The return value depends only on EOF:
> +    - read_mbtail() returns "false" is a good character was read, but also
> +    - read_mbtail() returns "false" if an incomplete byte sequence was found
> +      and no EOF happened.
> +
> +    For example, suppose we have an ujis file with bytes 0x8FA10A, where:
> +    - 0x8FA1 is an incomplete prefix of a 3-byte character
> +      (it should be [8F][A1-FE][A1-FE] to make a full 3-byte character)
> +    - 0x0A is a line demiliter
> +    This file has some broken data, the trailing [A1-FE] is missing.
> +
> +    In this example it works as follows:
> +    - 0x8F is read from the file and put into "data" before the call
> +      for read_mbtail()
> +    - 0xA1 is read from the file and put into "data" by read_mbtail()
> +    - 0x0A is kept in the read queue, so the next read iteration after
> +      the current read_mbtail() call will normally find it and recognize as
> +      a line delimiter
> +    - the current call for read_mbtail() returns "false",
> +      because no EOF happened
> +  */
> +  bool read_mbtail(String *str)
> +  {
> +    int chlen;
> +    if ((chlen= my_charlen_tmp(read_charset, str->end() - 1, str->end())) == 1)
> +      return false; // Single byte character found
> +    for (uint32 length0= str->length() - 1 ; MY_CS_IS_TOOSMALL(chlen); )
> +    {
> +      int chr= GET;
> +      if (chr == my_b_EOF)
> +      {
> +        DBUG_PRINT("info", ("read_mbtail: chlen=%d; unexpected EOF", chlen));
> +        return true; // EOF
> +      }
> +      str->append(chr);
> +      chlen= my_charlen_tmp(read_charset, str->ptr() + length0, str->end());
> +      if (chlen == MY_CS_ILSEQ)
> +      {
> +        /**
> +          It has been an incomplete (but a valid) sequence so far,
> +          but the last byte turned it into a bad byte sequence.
> +          Unget the very last byte.
> +        */
> +        str->length(str->length() - 1);
> +        PUSH(chr);
> +        DBUG_PRINT("info", ("read_mbtail: ILSEQ"));
> +        return false; // Bad byte sequence
> +      }
> +    }
> +    DBUG_PRINT("info", ("read_mbtail: chlen=%d", chlen));
> +    return false; // Good multi-byte character
> +  }
> +
>  public:
>    bool error,line_cuted,found_null,enclosed;
>    uchar	*row_start,			/* Found row starts here */
> @@ -1510,7 +1585,8 @@ int READ_INFO::read_field()
>  
>    for (;;)
>    {
> -    while ( to < end_of_buff)
> +    // Make sure we have enough space for the longest multi-byte character.
> +    while ( to + read_charset->mbmaxlen < end_of_buff)
>      {
>        chr = GET;
>        if (chr == my_b_EOF)
> @@ -1598,52 +1674,27 @@ int READ_INFO::read_field()
>  	}
>        }
>  #ifdef USE_MB
> -        uint ml= my_mbcharlen(read_charset, chr);
> -        if (ml == 0)
> -        {
> -          *to= '\0';
> -          my_error(ER_INVALID_CHARACTER_STRING, MYF(0),
> -                   read_charset->csname, buffer);
> -          error= true;
> -          return 1;
> -        }
> -
> -        if (ml > 1 &&
> -            to + ml <= end_of_buff)
> -        {
> -          uchar* p= to;
> -          *to++ = chr;
> -
> -          for (uint i= 1; i < ml; i++)
> -          {
> -            chr= GET;
> -            if (chr == my_b_EOF)
> -            {
> -              /*
> -                Need to back up the bytes already ready from illformed
> -                multi-byte char 
> -              */
> -              to-= i;
> -              goto found_eof;
> -            }
> -            *to++ = chr;
> -          }
> -          if (my_ismbchar(read_charset,
> -                        (const char *)p,
> -                        (const char *)to))
> -            continue;
> -          for (uint i= 0; i < ml; i++)
> -            PUSH(*--to);
> -          chr= GET;
> -        }
> -        else if (ml > 1)
> -        {
> -          // Buffer is too small, exit while loop, and reallocate.
> -          PUSH(chr);
> -          break;
> -        }
>  #endif
>        *to++ = (uchar) chr;
> +#if MYSQL_VERSION_ID >= 100200
> +#error This 10.0 and 10.1 specific fix should be removed in 10.2
> +#else
> +      if (my_mbcharlen(read_charset, (uchar) chr) > 1)
> +      {
> +        /*
> +          A known MBHEAD found. Try to scan the full multi-byte character.
> +          Otherwise, a possible following second byte 0x5C would be
> +          mis-interpreted as an escape on the next iteration.
> +          (Important for big5, gbk, sjis, cp932).
> +        */
> +        String tmp((char *) to - 1, read_charset->mbmaxlen, read_charset);
> +        tmp.length(1);
> +        bool eof= read_mbtail(&tmp);
> +        to+= tmp.length() - 1;
> +        if (eof)
> +          goto found_eof;
> +      }
> +#endif
>      }
>      /*
>      ** We come here if buffer is too small. Enlarge it and continue

Regards,
Sergei
Chief Architect MariaDB
and security@xxxxxxxxxxx
Follow ups

Re: MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character
From: Alexander Barkov, 2016-11-28
References

MDEV-11343 LOAD DATA INFILE fails to load data with an escape character followed by a multi-byte character
From: Alexander Barkov, 2016-11-25