maria-developers team mailing list archive

Thread
Date

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion

To: Alexander Barkov <bar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Thu, 12 Mar 2015 17:47:56 +0100
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <54F9B5A3.2080709@mariadb.org>
User-agent: Mutt/1.5.23 (2014-03-12)

Hi, Alexander!

See below.

My main concern is the stability of the charset API.
I'd like it to be moved into a service, as plugins surely need it
(not "will need", various plugins need it even now). But you change it
quite often, so it would make a rather volatile service :(

> diff --git a/sql/sql_string.cc b/sql/sql_string.cc
> index 9fb462e..a0b6395 100644
> --- a/sql/sql_string.cc
> +++ b/sql/sql_string.cc
> @@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length)
>        my_charset_same(from_cs, to_cs))
>    {
>      m_cannot_convert_error_pos= NULL;
> -    return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length,
> -                                   nchars, this);
> +    return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length,
> +                                 nchars, this);

Now you don't use copy_abort() anywhere.
and you've added copy_abort with the comment "A preparatory patch for MDEV-6566"

I believe you need to remove it now, may be revert the whole
commit b1b6101a

>    }
>    else
>    {
> diff --git a/include/m_ctype.h b/include/m_ctype.h
> index f08efb4..4fa8779 100644
> --- a/include/m_ctype.h
> +++ b/include/m_ctype.h
> @@ -444,7 +444,55 @@ struct my_charset_handler_st
>    size_t        (*scan)(CHARSET_INFO *, const char *b, const char *e,
>                          int sq);
>  
> -  /* Copying routines */
> +  /* String copying routines and helpers for them */
> +  /*
> +    charlen() - calculate length of the left-most character in bytes.
> +    @param  cs    Character set
> +    @param  str   The beginning of the string
> +    @param  end   The end of the string
> +    
> +    @return       MY_CS_ILSEQ if a bad byte sequence was found.
> +    @return       MY_CS_TOOSMALLN(x) if the string ended unexpectedly.
> +    @return       a positive number in the range 1..mbmaxlen,
> +                  if a valid character was found.
> +  */
> +  int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);

Why couldn't you use mbcharlen for that?
I mean, why not to add validity checks to mbcharlen?

or mb_wc

> +  /*
> +    well_formed_char_length() - returns character length of a string.
> +    
> +    @param cs          Character set
> +    @param str         The beginning of the string
> +    @param end         The end of the string
> +    @param nchars      Not more than "nchars" left-most characters are checked.
> +    @param status[OUT] Additional statistics is returned here.
> +                       "status" can be uninitialized before the call,
> +                       and it is fully initialized after the call.
> +    
> +    status->m_source_end_pos is set to the position where reading stopped.
> +    
> +    If a bad byte sequence is found, the function returns immediately and
> +    status->m_well_formed_error_pos is set to the position where a bad byte
> +    sequence was found.
> +    
> +    status->m_well_formed_error_pos is set to NULL if no bad bytes were found.
> +    If status->m_well_formed_error_pos is NULL after the call, that means:
> +    - either the function reached the end of the string,
> +    - or all "nchars" characters were read.
> +    The caller can check status->m_source_end_pos to detect which of these two
> +    happened.
> +  */
> +  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
> +                                    const char *str, const char *end,
> +                                    size_t nchars,
> +                                    MY_STRCOPY_STATUS *status);

What's the difference with numchars?

> +
> +  /*
> +    copy_fix() - copy a string like copy_abort(), but fix bad bytes to '?'.
> +  */
> +  size_t  (*copy_fix)(CHARSET_INFO *,
> +                      char *dst, size_t dst_length,
> +                      const char *src, size_t src_length,
> +                      size_t nchars, MY_STRCOPY_STATUS *status);
>    /*
>      copy_abort() - copy a string, abort if a bad byte sequence was found.
>      Not more than "nchars" characters are copied.
> diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
> index b010c52..35ae191 100644
> --- a/strings/ctype-simple.c
> +++ b/strings/ctype-simple.c
> @@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)),
>  }
>  
>  
> +size_t
> +my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),

Try to put the type and the function name on one line.
As I've noticed yesterday, git diff doesn't recognizes it as a function
name otherwise

> +                                const char *start, const char *end,
> +                                size_t nchars, MY_STRCOPY_STATUS *status)
> +{
> +  size_t nbytes= (size_t) (end - start);
> +  size_t res= MY_MIN(nbytes, nchars);
> +  status->m_well_formed_error_pos= NULL;
> +  status->m_source_end_pos= start + res;
> +  return res;
> +}

Regards,
Sergei

Follow ups

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-12

References

MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-06