maria-developers team mailing list archive

Thread
Date

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion

To: Alexander Barkov <bar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Thu, 12 Mar 2015 20:57:22 +0100
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <5501D3D1.6060807@mariadb.org>
User-agent: Mutt/1.5.23 (2014-03-12)

Hi, Alexander!

> On 03/12/2015 08:47 PM, Sergei Golubchik wrote:
> > Hi, Alexander!
> >
> > See below.
> >
> > My main concern is the stability of the charset API.
> > I'd like it to be moved into a service, as plugins surely need it
> > (not "will need", various plugins need it even now). But you change it
> > quite often, so it would make a rather volatile service :(
> 
> I believe my recent changes don't not break anything in the existing
> plugins.

I don't care much about that, because as long as charsets are not a
service we do not promise ABI stability.

> I'm fine about services, but in a separate change please.

Of course, I didn't mean you should do it now.
It should be a separate MDEV too.

> >> diff --git a/sql/sql_string.cc b/sql/sql_string.cc
> >> index 9fb462e..a0b6395 100644
> >> --- a/sql/sql_string.cc
> >> +++ b/sql/sql_string.cc
> >> @@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length)
> >>         my_charset_same(from_cs, to_cs))
> >>     {
> >>       m_cannot_convert_error_pos= NULL;
> >> -    return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length,
> >> -                                   nchars, this);
> >> +    return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length,
> >> +                                 nchars, this);
> >
> > Now you don't use copy_abort() anywhere.
> > and you've added copy_abort with the comment "A preparatory patch for MDEV-6566"
> >
> > I believe you need to remove it now, may be revert the whole
> > commit b1b6101a
> 
> I was thinking that I'd need it for:
> 
> MDEV-6572 "USE dbname" with a bad sequence erroneously connects to a 
> wrong database
> 
> But this particular bug can be fixed with copy_fix() as well.
> Moreover, the "fixed" version of a bad database name will be
> more useful for error reporting than the "aborted" version anyway.
> 
> Okey, I can remove copy_abort() for now. It will be easy to restore it
> if we really need it some day.
> 
> There is no need to remove the entire b1b6101a,
> because it added MY_STRCOPY_STATUS, which I used
> in a later patch 72d7b12b9c9c5ceffef9fff3adc86c149f57f20f,
> and use in this patch.
> So this is only to remove the my_copy_abort_*** functions.

Agree

> >> diff --git a/include/m_ctype.h b/include/m_ctype.h
> >> index f08efb4..4fa8779 100644
> >> --- a/include/m_ctype.h
> >> +++ b/include/m_ctype.h
> >> @@ -444,7 +444,55 @@ struct my_charset_handler_st
> >>     size_t        (*scan)(CHARSET_INFO *, const char *b, const char *e,
> >>                           int sq);
> >>
> >> -  /* Copying routines */
> >> +  /* String copying routines and helpers for them */
> >> +  /*
> >> +    charlen() - calculate length of the left-most character in bytes.
> >> +    @param  cs    Character set
> >> +    @param  str   The beginning of the string
> >> +    @param  end   The end of the string
> >> +
> >> +    @return       MY_CS_ILSEQ if a bad byte sequence was found.
> >> +    @return       MY_CS_TOOSMALLN(x) if the string ended unexpectedly.
> >> +    @return       a positive number in the range 1..mbmaxlen,
> >> +                  if a valid character was found.
> >> +  */
> >> +  int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);
> >
> > Why couldn't you use mbcharlen for that?
> > I mean, why not to add validity checks to mbcharlen?
> 
> Exactly to have a stable MY_CHARSET_INFO ABI in 10.1.

But it wouldn't, would it?

The only difference between mbcharlen and your new charlen is - as far
as I can see - that mbcharlen has fewer validity checks and may
sometimes return incorrect result (e.g. it can return 4 when end-str==2).

One can argue that it's a bug in mbcharlen, and fix it always to return
a correct value. This won't change the ABI.
And you won't need charlen(), because it's nothing more than "mbcharlen
with the bug fixed".

> >> +  /*
> >> +    well_formed_char_length() - returns character length of a string.
> >> +
> >> +    @param cs          Character set
> >> +    @param str         The beginning of the string
> >> +    @param end         The end of the string
> >> +    @param nchars      Not more than "nchars" left-most characters are checked.
> >> +    @param status[OUT] Additional statistics is returned here.
> >> +                       "status" can be uninitialized before the call,
> >> +                       and it is fully initialized after the call.
> >> +
> >> +    status->m_source_end_pos is set to the position where reading stopped.
> >> +
> >> +    If a bad byte sequence is found, the function returns immediately and
> >> +    status->m_well_formed_error_pos is set to the position where a bad byte
> >> +    sequence was found.
> >> +
> >> +    status->m_well_formed_error_pos is set to NULL if no bad bytes were found.
> >> +    If status->m_well_formed_error_pos is NULL after the call, that means:
> >> +    - either the function reached the end of the string,
> >> +    - or all "nchars" characters were read.
> >> +    The caller can check status->m_source_end_pos to detect which of these two
> >> +    happened.
> >> +  */
> >> +  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
> >> +                                    const char *str, const char *end,
> >> +                                    size_t nchars,
> >> +                                    MY_STRCOPY_STATUS *status);
> >
> > What's the difference with numchars?
> 
> numchars() can only return number of characters.
> 
> well_formed_char_length() returns 3 things:
> - the number of characters (as a return value)
> - length in bytes (in status)
> - if there were errors (in status)
> 
> All three things are needed for copy_fix().
> 
> The function is needed to avoid doing 3 separate loops in the same string.

Okay, then you don't need numchars, do you? There could be a
charset-independent wrapper that calls well_formed_char_length()
and return the number of characters. Same for well_formed_len(), I
suppose. That is both numchars and well_formed_len can be removed from
the MY_CHARSET_HANDLER structure.

But this will certainly change the ABI, so we can postpone it until 10.2
(with an #ifdef, as you suggest - that was a good idea).

> >> diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
> >> index b010c52..35ae191 100644
> >> --- a/strings/ctype-simple.c
> >> +++ b/strings/ctype-simple.c
> >> @@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)),
> >>   }
> >>
> >> +size_t
> >> +my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),
> >
> > Try to put the type and the function name on one line.
> > As I've noticed yesterday, git diff doesn't recognizes it as a function
> > name otherwise
> 
> Oops. This is too bad. I can change this particular patch.
> 
> But we have this all around the code, and it's very easy
> to find function definitions this way using "grep".
> It will be pity to remove this style.
> 
> Shouldn't we report this to the git team?

Ok, I'm confused now. I've seen these incorrect diff tags in this your
patch and in sanja commits too.

But now I've tried and couldn't repeat it myself. So, forget it, the
issue needs more investigation.

Regards,
Sergei

Follow ups

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-13

References

MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-06
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-12
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-12