maria-developers team mailing list archive

Thread
Date

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Fri, 26 Jun 2015 12:09:18 +0400
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <20150624164630.GB17670@meddwl.fritz.box>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0

Hi Sergei,


On 06/24/2015 08:46 PM, Sergei Golubchik wrote:

Hi, Alexander!

On Jun 23, Alexander Barkov wrote:

+#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_chinese_ci
+#define WEIGHT_MB1(x)        (sort_order_big5[(uchar) (x)])
+#define WEIGHT_MB2(x,y)      (big5code(x, y))
+#include "ctype-strcoll.ic"
+
+
+#undef MY_FUNCTION_NAME
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2


alternatively you could undef them at the end of ctype-strcoll.ic
it'll also be a protection against including this file twice
(although it cannot be done now anyway)


Thanks for the idea. This can certainly be done for
WEIGHT_MB1 and WEIGHT_MB2.


Done. Now strcoll.ic ends with:

#undef MY_FUNCTION_NAME
#undef WEIGHT_ILSEQ
#undef WEIGHT_MB1
#undef WEIGHT_MB2
#undef WEIGHT_PAD_SPACE

So the files that include strcoll.ic do not have to undefine.
Looks better.

<skip>


Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and
ctype-strncoll.ic


Yes, I would've done that, I guess.


Done.

<skip>


By the way, could you please rename sql/sql_plugin_services.h to
sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic?

As you're introducing a new convention, let's follow it consistently.

Agree. Note, I stole the idea from InnoDB, they use this *.ic extensionwidely.


I created a separate task to rename the other files:

MDEV-8384 Rename included source files to *.ic

<skip>

+/*
+  The weight for automatically padded spaces when comparing strings with
+  the PAD SPACE property.
+  Should normally be equal to the weight of a regular space.


it is not a space for sjis and cp932? what does that mean?
padding space is not the same as a regular space? why?


It is the same.

- For sjis and cp932 wight for a single byte character is just
defined in a more complex way:

#define WEIGHT_MB1(x)        (256 * (int) (uchar) (x))
#define WEIGHT_PAD_SPACE     (256 * (int) ' ')

So regular and pad spaces have the same weight again: 0x2000


In that case it might've been easier to read (less questions asked) if
you'd define it unconditionally as

    #define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')


In case of a _ci collation WEIGHT_MB1 uses a conversion table,
to provide case insensitivity:

#define WEIGHT_MB1(x)        (256 * (int) sort_order_sjis[(uchar) (x)])


So WEIGHT_PAD_SPACE defined through WEIGHT_MB1 would be:

    (256 * (int) sort_order_sjis[(uchar) ' '])

instead of just:

   (256 * (int) (uchar) (x))


So I decided to keep it in the way I've done,
and added more comments near the default definition in strcoll.ic

+/*
+  Weight of an illegal byte.
+  Must be greater than weight of any normal character.
+  Two bad bytes are compared binary.
+*/
+#ifndef WEIGHT_ILSEQ
+#define WEIGHT_ILSEQ(x)   (0xFF00 + (x))


why #ifndef? it creates an impression that the caller (includer :)
can redefine it. better define WEIGHT_ILSEQ here unconditionally
and undefine it at the end


You got a correct impression.

The next step is to implement Asian MB3 character sets, eucjpms and
ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will
redefine it, most likely like this:

#define WEIGHT_ILSEQ(x)   (0xFFFF00 + (x))


I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one
thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only
depends on the max_char_length.

I cannot immediately suggest a better definition though (constant and
easy to understand), so it's ok to keep it like you've done.


Okey.

Running mtr last time now, then will push it.

Thanks for your suggestions!


Regards,
Sergei

References

MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-05-24
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Sergei Golubchik, 2015-06-18
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-06-23
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Sergei Golubchik, 2015-06-24