maria-developers team mailing list archive

Thread
Date

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Tue, 23 Jun 2015 16:09:39 +0400
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <20150618200416.GA24777@meddwl.fritz.box>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0

Hi Sergei,

Thanks for your review. Please see comments inline.

On 06/19/2015 12:04 AM, Sergei Golubchik wrote:

Hi, Alexander!

On May 25, Alexander Barkov wrote:

    Hi Sergei,

For simplicity I decided to split:

MDEV-8036 Fix all collations to compare broken bytes as "greater than
any non-broken character"

into sub-tasks.

Please review a patch for the first sub-task MDEV-8214.


I don't know why you wanted me to review it :)


I felt so insecure and felt that I needed help like I'd never done before :)

Seriously, there are some thing that did not look nice,
for example this undef/ifdef stuff.

Looks pretty ok. Of course there were few comments and questions, but
nothing serious.

diff --git a/strings/ctype-big5.c b/strings/ctype-big5.c
index eda81c0..aa06a7a 100644
--- a/strings/ctype-big5.c
+++ b/strings/ctype-big5.c
@@ -6853,11 +6771,29 @@ my_mb_wc_big5(CHARSET_INFO *cs __attribute__((unused)),
  }


-static MY_COLLATION_HANDLER my_collation_big5_chinese_ci_handler =
+#undef MY_FUNCTION_NAME
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2


why do you start from undef-ing macros that couldn't be defined here?


ctype-strcoll.ic is included two times from every file,
for case insensitive and for binary collations respectively.

My intention was to make every inclusion of ctype-strcoll.ic look
exactly the same, no matter if it's the first or the second inclusion.
So whenever I need to include it from a new place,
I don't have to think and just copy and paste from any place.

+#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_chinese_ci
+#define WEIGHT_MB1(x)        (sort_order_big5[(uchar) (x)])
+#define WEIGHT_MB2(x,y)      (big5code(x, y))
+#include "ctype-strcoll.ic"
+
+
+#undef MY_FUNCTION_NAME
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2


alternatively you could undef them at the end of ctype-strcoll.ic
it'll also be a protection against including this file twice
(although it cannot be done now anyway)


Thanks for the idea. This can certainly be done for
WEIGHT_MB1 and WEIGHT_MB2.

What do you suggest to do with MY_FUNCTION_NAME?
It's now used in ctype-strcoll.ic in collation name context,
and in ctype-mb.ic in character name context.

So the structure of a file is about like this:


#define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis
#define ... character set specific definitions
#include "ctype-mb.ic"
...
#undef MY_FUNCTION_NAME
#define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis_japanese_ci
#define ... ci collation specific definitions
#include "ctype-strcoll.ic"
...
#undef MY_FUNCTION_NAME
#define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis_bin
#define ... bin collation specific definitions
#include "ctype-strcoll.ic"
...

Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic andctype-strncoll.ic



Another approach would be to have two things:

MY_CHARSET_FUNCTION_NAME
MY_COLLATION_FUNCTION_NAME

What will be easier to read?

+#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_bin
+#define WEIGHT_MB1(x)        ((uchar) (x))
+#define WEIGHT_MB2(x,y)      (big5code(x, y))
+#include "ctype-strcoll.ic"
+
+
+static MY_COLLATION_HANDLER my_collation_handler_big5_chinese_ci=
  {
    NULL,			/* init */
-  my_strnncoll_big5,
-  my_strnncollsp_big5,
+  my_strnncoll_big5_chinese_ci,
+  my_strnncollsp_big5_chinese_ci,
    my_strnxfrm_big5,
    my_strnxfrmlen_simple,
    my_like_range_mb,
diff --git a/strings/ctype-euc_kr.c b/strings/ctype-euc_kr.c
index a2c95bf..db47b3d 100644
--- a/strings/ctype-euc_kr.c
+++ b/strings/ctype-euc_kr.c
@@ -9938,21 +9940,56 @@ my_mb_wc_euc_kr(CHARSET_INFO *cs __attribute__((unused)),
  }


-static MY_COLLATION_HANDLER my_collation_ci_handler =
+#undef MY_FUNCTION_NAME
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2
+#define MY_FUNCTION_NAME(x)   my_ ## x ## _euckr_korean_ci
+#define WEIGHT_MB1(x)        (sort_order_euc_kr[(uchar) (x)])
+#define WEIGHT_MB2(x,y)      (euckrcode(x, y))
+#include "ctype-strcoll.ic"
+
+
+#undef MY_FUNCTION_NAME
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2
+#define MY_FUNCTION_NAME(x)   my_ ## x ## _euckr_bin
+#define WEIGHT_MB1(x)        ((uchar) (x))
+#define WEIGHT_MB2(x,y)      (euckrcode(x, y))
+#include "ctype-strcoll.ic"
+
+
+static MY_COLLATION_HANDLER my_collation_handler_euckr_korean_ci=
  {
-  NULL,			/* init */
-  my_strnncoll_simple,  /* strnncoll  */
-  my_strnncollsp_simple,
-  my_strnxfrm_mb,	/* strnxfrm   */
+  NULL,                 /* init */
+  my_strnncoll_euckr_korean_ci,
+  my_strnncollsp_euckr_korean_ci,


hmm, so is euckr multi-byte or not?
wikipedia says it is, my_strnxfrm_mb.
but why my_strnncoll_simple?


It is multi-byte, but a very simple one:

- a byte in the range 0x00..0x7F cannot be a part of a multi-byte
character,

- a weight of a multi-byte character is just equal to its multi-byterepresentation

- all multi-byte characters have its own distinct weight

So before this change strnncoll[sp] just needed to do some mapping on
the byte range 0x00..0x7F and do not do any mapping on the range
0x80..0xFF, and this is what my_strnncoll[sp]_simple can perfectly do.


Now my_strnncoll[sp]_simple do not work any more, because we need to
handle bad byte sequences differently.

+  my_strnxfrm_mb,
    my_strnxfrmlen_simple,
-  my_like_range_mb,     /* like_range */
-  my_wildcmp_mb,	/* wildcmp    */
+  my_like_range_mb,
+  my_wildcmp_mb,
    my_strcasecmp_mb,
    my_instr_mb,
    my_hash_sort_simple,
    my_propagate_simple
  };

+
+static MY_COLLATION_HANDLER my_collation_handler_euckr_bin=
+{
+  NULL,                 /* init */
+  my_strnncoll_euckr_bin,
+  my_strnncollsp_euckr_bin,
+  my_strnxfrm_mb,
+  my_strnxfrmlen_simple,
+  my_like_range_mb,
+  my_wildcmp_mb_bin,
+  my_strcasecmp_mb_bin,
+  my_instr_mb,
+  my_hash_sort_mb_bin,
+  my_propagate_simple
+};
+
+
  static MY_CHARSET_HANDLER my_charset_handler=
  {
    NULL,			/* init */
diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic
new file mode 100644
index 0000000..7217f99
--- /dev/null
+++ b/strings/ctype-strcoll.ic
@@ -0,0 +1,208 @@
+/*
+   Copyright (c) 2015, MariaDB Foundation
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; version 2 of the License.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program; if not, write to the Free Software
+   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+*/
+
+
+#ifndef MY_FUNCTION_NAME
+#error MY_FUNCTION_NAME is not defined
+#endif
+
+
+/*
+  The weight for automatically padded spaces when comparing strings with
+  the PAD SPACE property.
+  Should normally be equal to the weight of a regular space.


it is not a space for sjis and cp932? what does that mean?
padding space is not the same as a regular space? why?


It is the same.

- For other character sets it looks like:

#define WEIGHT_MB1(x)        (sort_order_euc_kr[(uchar) (x)])
#define WEIGHT_PAD_SPACE     ' '

So regular and pad spaces have the same weight: 0x20.



- For sjis and cp932 wight for a single byte character is just
defined in a more complex way:

#define WEIGHT_MB1(x)        (256 * (int) (uchar) (x))
#define WEIGHT_PAD_SPACE     (256 * (int) ' ')

So regular and pad spaces have the same weight again: 0x2000


This is to preserve this orderding:


/*
  cp932_chinese_ci and cp932_bin sort character blocks in this order:
  1. [00..7F]                - 7BIT characters (ASCII)
  2. [81..9F][40..7E,80..FC] - MB2 characters, part1
  3. [A1..DF]                - 8BIT characters (Kana)
  4. [E0..FC][40..7E,80..FC] - MB2 characters, part2
*/


I just decided to use 16 bit weights to handle all single byte
characters (both 00..7F and A1..DF) in the same way,
to make the code simpler.

+*/
+#ifndef WEIGHT_PAD_SPACE
+#define WEIGHT_PAD_SPACE  (' ')
+#endif
+
+
+/*
+  Weight of an illegal byte.
+  Must be greater than weight of any normal character.
+  Two bad bytes are compared binary.
+*/
+#ifndef WEIGHT_ILSEQ


why #ifndef? it creates an impression that the caller (includer :)
can redefine it. better define WEIGHT_ILSEQ here unconditionally
and undefine it at the end


You got a correct impression.

The next step is to implement Asian MB3 character sets, eucjpms and

ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they willredefine it, most likely like this:


#define WEIGHT_ILSEQ(x)   (0xFFFF00 + (x))

+#define WEIGHT_ILSEQ(x)   (0xFF00 + (x))
+#endif
+
+
+/**
+  Scan a valid character, or a bad byte, or an auto-padded space
+  from a string and calculate the weight of the scanned sequence.
+
+  @param [OUT] weight - the weight is returned here
+  @param str          - the string
+  @param end          - the end of the string
+  @return             - the number of bytes scanned
+
+  The including source file must define the following macros:
+  IS_MB1_CHAR(x)
+  IS_MB2_CHAR(x,y)
+  WEIGHT_PAD_SPACE
+  WEIGHT_MB1(x)
+  WEIGHT_MB2(x,y)
+  WEIGHT_ILSEQ(x)
+*/
+static inline uint
+MY_FUNCTION_NAME(scan_weight)(int *weight, const uchar *str, const uchar *end)
+{
+  if (str >= end)
+  {
+    *weight= WEIGHT_PAD_SPACE;
+    return 0;
+  }
+
+  if (IS_MB1_CHAR(*str))
+  {
+    *weight= WEIGHT_MB1(*str);           /* A valid single byte character*/
+    return 1;
+  }
+
+  if (str + 2 > end)                     /* The string ended unexpectedly */
+    goto bad;                            /* Treat as a bad byte */
+
+  if (IS_MB2_CHAR(str[0], str[1]))
+  {
+    *weight= WEIGHT_MB2(str[0], str[1]);
+    return 2;                            /* A valid two-byte character */
+  }
+
+bad:
+  *weight= WEIGHT_ILSEQ(str[0]);         /* Bad byte */
+  return 1;
+}
+
+
+/**
+  Compare two strings according to the collation,
+  without handling the PAD SPACE property.
+
+  Note, cs->coll->strnncoll() is usually used to compare identifiers.
+  Perhaps we should eventually (in 10.2?) create a new collation
+  my_charset_utf8_general_ci_no_pad and have only one comparison function
+  in MY_COLLATION_HANDLER.
+
+  @param cs          - the character set and collation
+  @param a           - the left string
+  @param a_length    - the length of the left string
+  @param b           - the right string
+  @param b_length    - the length of the right string
+  @param b_is_prefix - if the caller wants to check if "b" is a prefix of "a"
+  @return            - the comparison result
+*/
+static int
+MY_FUNCTION_NAME(strnncoll)(CHARSET_INFO *cs __attribute__((unused)),
+                            const uchar *a, size_t a_length,
+                            const uchar *b, size_t b_length,
+                            my_bool b_is_prefix)
+{
+  const uchar *a_end= a + a_length;
+  const uchar *b_end= b + b_length;
+  for ( ; ; )
+  {
+    int a_weight, b_weight, res;
+    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
+    uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end);
+    /*
+      a_wlen  b_wlen Comment
+      ------  ------ -------
+      0       0      Strings ended simultaneously, "a" and "b" are equal.
+      0       >0     "a" is a prefix of "b", so "a" is smaller.
+      >0      0      "b" is a prefix of "a", check b_is_prefix.
+      >0      >0     Two weights were scanned, check weight difference.
+    */
+    if (!a_wlen)
+      return b_wlen ? -b_weight : 0;


can weight be negative? if not, the code will be clearer if you declare it unsigned


Weights cannot be negative. But as I need things like:
(-b_weight) and (a_weight-b_weight)
I thought signed will be properer here.

+
+    if (!b_wlen)
+      return b_is_prefix ? 0 : a_weight;
+
+    if ((res= (a_weight - b_weight)))
+      return res;
+    /*
+      None of the strings has ended yet.
+    */
+    DBUG_ASSERT(a < a_end);
+    DBUG_ASSERT(b < b_end);
+    a+= a_wlen;
+    b+= b_wlen;
+  }
+  DBUG_ASSERT(0);
+  return 0;
+}
+
+
+/**
+  Compare two strings according to the collation, with PAD SPACE handling.
+
+  @param cs          - the character set and collation
+  @param a           - the left string
+  @param a_length    - the length of the left string
+  @param b           - the right string
+  @param b_length    - the length of the right string
+  @param diff_if_only_endspace_difference - not used in the code.
+                       TODO: this should be eventually removed (in 10.2?)


why not now (in a separate changeset, but in 10.1 still)?



I created:

MDEV-8360 Clean-up CHARSET_INFO: strnncollsp:diff_if_only_endspace_difference

+  @return            - the comparison result
+*/
+
+static int
+MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
+                              const uchar *a, size_t a_length,
+                              const uchar *b, size_t b_length,
+                              my_bool diff_if_only_endspace_difference
+                              __attribute__((unused)))
+{
+  const uchar *a_end= a + a_length;
+  const uchar *b_end= b + b_length;
+  for ( ; ; )
+  {
+    int a_weight, b_weight, res;
+    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
+    uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end);
+    if ((res= (a_weight - b_weight)))
+    {
+      /*
+        Got two different weights. Each weight can be generated by either of:
+        - a real character
+        - a bad byte sequence or an incomplete byte sequence
+        - an auto-generated trailing space (PAD SPACE)
+        It does not matter how exactly each weight was generated.
+        Just return the weight difference.
+      */
+      return res;
+    }
+    if (!a_wlen && !b_wlen)
+    {
+      /*
+        Got two auto-generated trailing spaces, i.e.
+        both strings have now ended, so they are equal.
+      */
+      DBUG_ASSERT(a == a_end);
+      DBUG_ASSERT(b == b_end);
+      return 0;
+    }
+    /*
+      At least one of the strings has not ended yet, continue comparison.
+    */
+    DBUG_ASSERT(a < a_end || b < b_end);
+    a+= a_wlen;
+    b+= b_wlen;
+  }
+  DBUG_ASSERT(0);
+  return 0;
+}


Regards,
Sergei

Follow ups

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Sergei Golubchik, 2015-06-24

References

MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-05-24
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Sergei Golubchik, 2015-06-18