maria-developers team mailing list archive

Thread
Date

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Thu, 12 Mar 2015 21:58:41 +0400
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <20150312164756.GA9307@meddwl.fritz.box>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi Sergei,

Thanks for review. Please see comments inline.


On 03/12/2015 08:47 PM, Sergei Golubchik wrote:

Hi, Alexander!

See below.

My main concern is the stability of the charset API.
I'd like it to be moved into a service, as plugins surely need it
(not "will need", various plugins need it even now). But you change it
quite often, so it would make a rather volatile service :(


I believe my recent changes don't not break anything in the existing
plugins.

Nothing has changed in between "init" and "scan",
which are the first and the last virtual function in
the 10.0 version of MY_CHARSET_HANDLER.


In 10.1 I recently added these functions into the very end of
MY_CHARSET_HANDLER:

charlen
well_formed_char_length
copy_fix
copy_abort


There can be problems only if somebody has already started to use
copy_abort(), which I added in 10.1 a few days ago.
But this should not happen that soon :)


I'm fine about services, but in a separate change please.

diff --git a/sql/sql_string.cc b/sql/sql_string.cc
index 9fb462e..a0b6395 100644
--- a/sql/sql_string.cc
+++ b/sql/sql_string.cc
@@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length)
        my_charset_same(from_cs, to_cs))
    {
      m_cannot_convert_error_pos= NULL;
-    return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length,
-                                   nchars, this);
+    return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length,
+                                 nchars, this);


Now you don't use copy_abort() anywhere.
and you've added copy_abort with the comment "A preparatory patch for MDEV-6566"

I believe you need to remove it now, may be revert the whole
commit b1b6101a



I was thinking that I'd need it for:

MDEV-6572 "USE dbname" with a bad sequence erroneously connects to awrong database


But this particular bug can be fixed with copy_fix() as well.
Moreover, the "fixed" version of a bad database name will be
more useful for error reporting than the "aborted" version anyway.


Okey, I can remove copy_abort() for now. It will be easy to restore it
if we really need it some day.


There is no need to remove the entire b1b6101a,
because it added MY_STRCOPY_STATUS, which I used
in a later patch 72d7b12b9c9c5ceffef9fff3adc86c149f57f20f,
and use in this patch.
So this is only to remove the my_copy_abort_*** functions.

    }
    else
    {
diff --git a/include/m_ctype.h b/include/m_ctype.h
index f08efb4..4fa8779 100644
--- a/include/m_ctype.h
+++ b/include/m_ctype.h
@@ -444,7 +444,55 @@ struct my_charset_handler_st
    size_t        (*scan)(CHARSET_INFO *, const char *b, const char *e,
                          int sq);

-  /* Copying routines */
+  /* String copying routines and helpers for them */
+  /*
+    charlen() - calculate length of the left-most character in bytes.
+    @param  cs    Character set
+    @param  str   The beginning of the string
+    @param  end   The end of the string
+
+    @return       MY_CS_ILSEQ if a bad byte sequence was found.
+    @return       MY_CS_TOOSMALLN(x) if the string ended unexpectedly.
+    @return       a positive number in the range 1..mbmaxlen,
+                  if a valid character was found.
+  */
+  int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);


Why couldn't you use mbcharlen for that?
I mean, why not to add validity checks to mbcharlen?


Exactly to have a stable MY_CHARSET_INFO ABI in 10.1.

I'm fine to remove mbcharlen in the very beginning of 10.2.

I guess it's a good idea to add #ifdef not to forget to do this.


or mb_wc


mb_wc() is not good here, as it involves Unicode conversion table
lookups, which needs much more CPU resources than just detection of
a character length.

Also, the caller of mb_wc() has to handle special
cases when a character is valid but is not assigned,
which needs even more CPU.

+  /*
+    well_formed_char_length() - returns character length of a string.
+
+    @param cs          Character set
+    @param str         The beginning of the string
+    @param end         The end of the string
+    @param nchars      Not more than "nchars" left-most characters are checked.
+    @param status[OUT] Additional statistics is returned here.
+                       "status" can be uninitialized before the call,
+                       and it is fully initialized after the call.
+
+    status->m_source_end_pos is set to the position where reading stopped.
+
+    If a bad byte sequence is found, the function returns immediately and
+    status->m_well_formed_error_pos is set to the position where a bad byte
+    sequence was found.
+
+    status->m_well_formed_error_pos is set to NULL if no bad bytes were found.
+    If status->m_well_formed_error_pos is NULL after the call, that means:
+    - either the function reached the end of the string,
+    - or all "nchars" characters were read.
+    The caller can check status->m_source_end_pos to detect which of these two
+    happened.
+  */
+  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
+                                    const char *str, const char *end,
+                                    size_t nchars,
+                                    MY_STRCOPY_STATUS *status);


What's the difference with numchars?


numchars() can only return number of characters.

well_formed_char_length() returns 3 things:
- the number of characters (as a return value)
- length in bytes (in status)
- if there were errors (in status)

All three things are needed for copy_fix().

The function is needed to avoid doing 3 separate loops in the same string.

+
+  /*
+    copy_fix() - copy a string like copy_abort(), but fix bad bytes to '?'.
+  */
+  size_t  (*copy_fix)(CHARSET_INFO *,
+                      char *dst, size_t dst_length,
+                      const char *src, size_t src_length,
+                      size_t nchars, MY_STRCOPY_STATUS *status);
    /*
      copy_abort() - copy a string, abort if a bad byte sequence was found.
      Not more than "nchars" characters are copied.
diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
index b010c52..35ae191 100644
--- a/strings/ctype-simple.c
+++ b/strings/ctype-simple.c
@@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)),
  }


+size_t
+my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),


Try to put the type and the function name on one line.
As I've noticed yesterday, git diff doesn't recognizes it as a function
name otherwise


Oops. This is too bad. I can change this particular patch.

But we have this all around the code, and it's very easy
to find function definitions this way using "grep".
It will be pity to remove this style.

Shouldn't we report this to the git team?

Thanks!

+                                const char *start, const char *end,
+                                size_t nchars, MY_STRCOPY_STATUS *status)
+{
+  size_t nbytes= (size_t) (end - start);
+  size_t res= MY_MIN(nbytes, nchars);
+  status->m_well_formed_error_pos= NULL;
+  status->m_source_end_pos= start + res;
+  return res;
+}


Regards,
Sergei

Follow ups

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-12

References

MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-06
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-12