← Back to team overview

maria-developers team mailing list archive

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion

 

Hi, Alexander!

See below.

My main concern is the stability of the charset API.
I'd like it to be moved into a service, as plugins surely need it
(not "will need", various plugins need it even now). But you change it
quite often, so it would make a rather volatile service :(

> diff --git a/sql/sql_string.cc b/sql/sql_string.cc
> index 9fb462e..a0b6395 100644
> --- a/sql/sql_string.cc
> +++ b/sql/sql_string.cc
> @@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length)
>        my_charset_same(from_cs, to_cs))
>    {
>      m_cannot_convert_error_pos= NULL;
> -    return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length,
> -                                   nchars, this);
> +    return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length,
> +                                 nchars, this);

Now you don't use copy_abort() anywhere.
and you've added copy_abort with the comment "A preparatory patch for MDEV-6566"

I believe you need to remove it now, may be revert the whole
commit b1b6101a

>    }
>    else
>    {
> diff --git a/include/m_ctype.h b/include/m_ctype.h
> index f08efb4..4fa8779 100644
> --- a/include/m_ctype.h
> +++ b/include/m_ctype.h
> @@ -444,7 +444,55 @@ struct my_charset_handler_st
>    size_t        (*scan)(CHARSET_INFO *, const char *b, const char *e,
>                          int sq);
>  
> -  /* Copying routines */
> +  /* String copying routines and helpers for them */
> +  /*
> +    charlen() - calculate length of the left-most character in bytes.
> +    @param  cs    Character set
> +    @param  str   The beginning of the string
> +    @param  end   The end of the string
> +    
> +    @return       MY_CS_ILSEQ if a bad byte sequence was found.
> +    @return       MY_CS_TOOSMALLN(x) if the string ended unexpectedly.
> +    @return       a positive number in the range 1..mbmaxlen,
> +                  if a valid character was found.
> +  */
> +  int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);

Why couldn't you use mbcharlen for that?
I mean, why not to add validity checks to mbcharlen?

or mb_wc

> +  /*
> +    well_formed_char_length() - returns character length of a string.
> +    
> +    @param cs          Character set
> +    @param str         The beginning of the string
> +    @param end         The end of the string
> +    @param nchars      Not more than "nchars" left-most characters are checked.
> +    @param status[OUT] Additional statistics is returned here.
> +                       "status" can be uninitialized before the call,
> +                       and it is fully initialized after the call.
> +    
> +    status->m_source_end_pos is set to the position where reading stopped.
> +    
> +    If a bad byte sequence is found, the function returns immediately and
> +    status->m_well_formed_error_pos is set to the position where a bad byte
> +    sequence was found.
> +    
> +    status->m_well_formed_error_pos is set to NULL if no bad bytes were found.
> +    If status->m_well_formed_error_pos is NULL after the call, that means:
> +    - either the function reached the end of the string,
> +    - or all "nchars" characters were read.
> +    The caller can check status->m_source_end_pos to detect which of these two
> +    happened.
> +  */
> +  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
> +                                    const char *str, const char *end,
> +                                    size_t nchars,
> +                                    MY_STRCOPY_STATUS *status);

What's the difference with numchars?

> +
> +  /*
> +    copy_fix() - copy a string like copy_abort(), but fix bad bytes to '?'.
> +  */
> +  size_t  (*copy_fix)(CHARSET_INFO *,
> +                      char *dst, size_t dst_length,
> +                      const char *src, size_t src_length,
> +                      size_t nchars, MY_STRCOPY_STATUS *status);
>    /*
>      copy_abort() - copy a string, abort if a bad byte sequence was found.
>      Not more than "nchars" characters are copied.
> diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
> index b010c52..35ae191 100644
> --- a/strings/ctype-simple.c
> +++ b/strings/ctype-simple.c
> @@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)),
>  }
>  
>  
> +size_t
> +my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),

Try to put the type and the function name on one line.
As I've noticed yesterday, git diff doesn't recognizes it as a function
name otherwise

> +                                const char *start, const char *end,
> +                                size_t nchars, MY_STRCOPY_STATUS *status)
> +{
> +  size_t nbytes= (size_t) (end - start);
> +  size_t res= MY_MIN(nbytes, nchars);
> +  status->m_well_formed_error_pos= NULL;
> +  status->m_source_end_pos= start + res;
> +  return res;
> +}

Regards,
Sergei


Follow ups

References