← Back to team overview

maria-developers team mailing list archive

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"


Hi Sergei,

On 06/24/2015 08:46 PM, Sergei Golubchik wrote:
Hi, Alexander!

On Jun 23, Alexander Barkov wrote:

+#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_chinese_ci
+#define WEIGHT_MB1(x)        (sort_order_big5[(uchar) (x)])
+#define WEIGHT_MB2(x,y)      (big5code(x, y))
+#include "ctype-strcoll.ic"
+#undef WEIGHT_MB1
+#undef WEIGHT_MB2

alternatively you could undef them at the end of ctype-strcoll.ic
it'll also be a protection against including this file twice
(although it cannot be done now anyway)

Thanks for the idea. This can certainly be done for

Done. Now strcoll.ic ends with:

#undef WEIGHT_MB1
#undef WEIGHT_MB2

So the files that include strcoll.ic do not have to undefine.
Looks better.


Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and

Yes, I would've done that, I guess.



By the way, could you please rename sql/sql_plugin_services.h to
sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic?

As you're introducing a new convention, let's follow it consistently.

Agree. Note, I stole the idea from InnoDB, they use this *.ic extension widely.

I created a separate task to rename the other files:

MDEV-8384 Rename included source files to *.ic


+  The weight for automatically padded spaces when comparing strings with
+  the PAD SPACE property.
+  Should normally be equal to the weight of a regular space.

it is not a space for sjis and cp932? what does that mean?
padding space is not the same as a regular space? why?

It is the same.

- For sjis and cp932 wight for a single byte character is just
defined in a more complex way:

#define WEIGHT_MB1(x)        (256 * (int) (uchar) (x))
#define WEIGHT_PAD_SPACE     (256 * (int) ' ')

So regular and pad spaces have the same weight again: 0x2000

In that case it might've been easier to read (less questions asked) if
you'd define it unconditionally as

    #define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')

In case of a _ci collation WEIGHT_MB1 uses a conversion table,
to provide case insensitivity:

#define WEIGHT_MB1(x)        (256 * (int) sort_order_sjis[(uchar) (x)])

So WEIGHT_PAD_SPACE defined through WEIGHT_MB1 would be:

    (256 * (int) sort_order_sjis[(uchar) ' '])

instead of just:

   (256 * (int) (uchar) (x))

So I decided to keep it in the way I've done,
and added more comments near the default definition in strcoll.ic

+  Weight of an illegal byte.
+  Must be greater than weight of any normal character.
+  Two bad bytes are compared binary.
+#define WEIGHT_ILSEQ(x)   (0xFF00 + (x))

why #ifndef? it creates an impression that the caller (includer :)
can redefine it. better define WEIGHT_ILSEQ here unconditionally
and undefine it at the end

You got a correct impression.

The next step is to implement Asian MB3 character sets, eucjpms and
ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will
redefine it, most likely like this:

#define WEIGHT_ILSEQ(x)   (0xFFFF00 + (x))

I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one
thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only
depends on the max_char_length.

I cannot immediately suggest a better definition though (constant and
easy to understand), so it's ok to keep it like you've done.


Running mtr last time now, then will push it.

Thanks for your suggestions!