maria-developers team mailing list archive

Thread
Date

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Fri, 13 Mar 2015 08:50:59 +0400
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <20150312195722.GB19952@meddwl.fritz.box>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi Sergei,


On 03/12/2015 11:57 PM, Sergei Golubchik wrote:

Hi, Alexander!

On 03/12/2015 08:47 PM, Sergei Golubchik wrote:

Hi, Alexander!

See below.

My main concern is the stability of the charset API.
I'd like it to be moved into a service, as plugins surely need it
(not "will need", various plugins need it even now). But you change it
quite often, so it would make a rather volatile service :(


I believe my recent changes don't not break anything in the existing
plugins.


I don't care much about that, because as long as charsets are not a
service we do not promise ABI stability.

I'm fine about services, but in a separate change please.


Of course, I didn't mean you should do it now.
It should be a separate MDEV too.


I have created https://mariadb.atlassian.net/browse/MDEV-7768
"Wrap character set and collations functions into services"

diff --git a/sql/sql_string.cc b/sql/sql_string.cc
index 9fb462e..a0b6395 100644
--- a/sql/sql_string.cc
+++ b/sql/sql_string.cc
@@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length)
         my_charset_same(from_cs, to_cs))
     {
       m_cannot_convert_error_pos= NULL;
-    return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length,
-                                   nchars, this);
+    return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length,
+                                 nchars, this);


Now you don't use copy_abort() anywhere.
and you've added copy_abort with the comment "A preparatory patch for MDEV-6566"

I believe you need to remove it now, may be revert the whole
commit b1b6101a


I was thinking that I'd need it for:

MDEV-6572 "USE dbname" with a bad sequence erroneously connects to a
wrong database

But this particular bug can be fixed with copy_fix() as well.
Moreover, the "fixed" version of a bad database name will be
more useful for error reporting than the "aborted" version anyway.

Okey, I can remove copy_abort() for now. It will be easy to restore it
if we really need it some day.

There is no need to remove the entire b1b6101a,
because it added MY_STRCOPY_STATUS, which I used
in a later patch 72d7b12b9c9c5ceffef9fff3adc86c149f57f20f,
and use in this patch.
So this is only to remove the my_copy_abort_*** functions.


Agree

diff --git a/include/m_ctype.h b/include/m_ctype.h
index f08efb4..4fa8779 100644
--- a/include/m_ctype.h
+++ b/include/m_ctype.h
@@ -444,7 +444,55 @@ struct my_charset_handler_st
     size_t        (*scan)(CHARSET_INFO *, const char *b, const char *e,
                           int sq);

-  /* Copying routines */
+  /* String copying routines and helpers for them */
+  /*
+    charlen() - calculate length of the left-most character in bytes.
+    @param  cs    Character set
+    @param  str   The beginning of the string
+    @param  end   The end of the string
+
+    @return       MY_CS_ILSEQ if a bad byte sequence was found.
+    @return       MY_CS_TOOSMALLN(x) if the string ended unexpectedly.
+    @return       a positive number in the range 1..mbmaxlen,
+                  if a valid character was found.
+  */
+  int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);


Why couldn't you use mbcharlen for that?
I mean, why not to add validity checks to mbcharlen?


Exactly to have a stable MY_CHARSET_INFO ABI in 10.1.


But it wouldn't, would it?


Providing that I don't touch the 10.0 part of MY_CHARSET_HANDLER
and only add functions into the end of MY_CHARSET_HANDLER,
both API and ABI remain stable after my 10.1 patches.
Adding new functions into the end of MY_CHARSET_HANDLER is safe.


As for removing some of the old functions from MY_CHARSET_HANDLER,
I had thoughts to do this in 10.2. I just created as a task:

https://mariadb.atlassian.net/browse/MDEV-7769
"MY_CHARSET_INFO refactoring"

It's very inline with your suggestions.


The only difference between mbcharlen and your new charlen is - as far
as I can see - that mbcharlen has fewer validity checks and may
sometimes return incorrect result (e.g. it can return 4 when end-str==2).

One can argue that it's a bug in mbcharlen, and fix it always to return
a correct value.


There are no bug in it. It's fine for some limited purposes.
But it's obviously too limited.
There will be a much more useful byte_property() instead.

This won't change the ABI.
And you won't need charlen(), because it's nothing more than "mbcharlen
with the bug fixed".


There are two separate things:

a. mbcharlen(), which operates on single bytes.
So will do the proposed byte_property().

b. ismbchar(), which operates on strings.
So does its replacement charlen().


Noteice, (a) and (b) are not replacements to each other!
They complement each other. I think we'll still need both
byte_property() and charlen().

See MDEV-7769 for details on byte_property().

+  /*
+    well_formed_char_length() - returns character length of a string.
+
+    @param cs          Character set
+    @param str         The beginning of the string
+    @param end         The end of the string
+    @param nchars      Not more than "nchars" left-most characters are checked.
+    @param status[OUT] Additional statistics is returned here.
+                       "status" can be uninitialized before the call,
+                       and it is fully initialized after the call.
+
+    status->m_source_end_pos is set to the position where reading stopped.
+
+    If a bad byte sequence is found, the function returns immediately and
+    status->m_well_formed_error_pos is set to the position where a bad byte
+    sequence was found.
+
+    status->m_well_formed_error_pos is set to NULL if no bad bytes were found.
+    If status->m_well_formed_error_pos is NULL after the call, that means:
+    - either the function reached the end of the string,
+    - or all "nchars" characters were read.
+    The caller can check status->m_source_end_pos to detect which of these two
+    happened.
+  */
+  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
+                                    const char *str, const char *end,
+                                    size_t nchars,
+                                    MY_STRCOPY_STATUS *status);


What's the difference with numchars?


numchars() can only return number of characters.

well_formed_char_length() returns 3 things:
- the number of characters (as a return value)
- length in bytes (in status)
- if there were errors (in status)

All three things are needed for copy_fix().

The function is needed to avoid doing 3 separate loops in the same string.


Okay, then you don't need numchars, do you? There could be a
charset-independent wrapper that calls well_formed_char_length()
and return the number of characters. Same for well_formed_len(), I
suppose. That is both numchars and well_formed_len can be removed from
the MY_CHARSET_HANDLER structure.


Exactly.


But this will certainly change the ABI, so we can postpone it until 10.2
(with an #ifdef, as you suggest - that was a good idea).


I think I'll add something like this:

#if MYSQL_VERSION_ID >= 100200
#warning MDEV-7769 "MY_CHARSET_INFO refactoring" must be done before 10.2.1
#endif


It will be too strict to fail compilation. But warning is fine.

diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
index b010c52..35ae191 100644
--- a/strings/ctype-simple.c
+++ b/strings/ctype-simple.c
@@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)),
   }

+size_t
+my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),


Try to put the type and the function name on one line.
As I've noticed yesterday, git diff doesn't recognizes it as a function
name otherwise


Oops. This is too bad. I can change this particular patch.

But we have this all around the code, and it's very easy
to find function definitions this way using "grep".
It will be pity to remove this style.

Shouldn't we report this to the git team?


Ok, I'm confused now. I've seen these incorrect diff tags in this your
patch and in sanja commits too.

But now I've tried and couldn't repeat it myself. So, forget it, the
issue needs more investigation.


Okey.


So, did we agree that I remove copy_abort() from MY_CHARSET_HANDLER,
and push?

Is there anything I missed?

Thanks.


Regards,
Sergei

Follow ups

Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-13

References

MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-06
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-12
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Alexander Barkov, 2015-03-12
Re: MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion
From: Sergei Golubchik, 2015-03-12