maria-developers team mailing list archive

Thread
Date

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"

To: Alexander Barkov <bar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Wed, 24 Jun 2015 18:46:30 +0200
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <55894C83.1030100@mariadb.org>
User-agent: Mutt/1.5.23 (2014-03-12)

Hi, Alexander!

On Jun 23, Alexander Barkov wrote:
> >
> >> +#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_chinese_ci
> >> +#define WEIGHT_MB1(x)        (sort_order_big5[(uchar) (x)])
> >> +#define WEIGHT_MB2(x,y)      (big5code(x, y))
> >> +#include "ctype-strcoll.ic"
> >> +
> >> +
> >> +#undef MY_FUNCTION_NAME
> >> +#undef WEIGHT_MB1
> >> +#undef WEIGHT_MB2
> >
> > alternatively you could undef them at the end of ctype-strcoll.ic
> > it'll also be a protection against including this file twice
> > (although it cannot be done now anyway)
> 
> Thanks for the idea. This can certainly be done for
> WEIGHT_MB1 and WEIGHT_MB2.
> 
> What do you suggest to do with MY_FUNCTION_NAME?
> It's now used in ctype-strcoll.ic in collation name context,
> and in ctype-mb.ic in character name context.
> 
> So the structure of a file is about like this:
> 
> #define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis
> #define ... character set specific definitions
> #include "ctype-mb.ic"
> ...
> #undef MY_FUNCTION_NAME
> #define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis_japanese_ci
> #define ... ci collation specific definitions
> #include "ctype-strcoll.ic"
> ...
> #undef MY_FUNCTION_NAME
> #define MY_FUNCTION_NAME(x)   my_ ## x ## _sjis_bin
> #define ... bin collation specific definitions
> #include "ctype-strcoll.ic"
> ...
> 
> Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and 
> ctype-strncoll.ic

Yes, I would've done that, I guess.

> Another approach would be to have two things:
> 
> MY_CHARSET_FUNCTION_NAME
> MY_COLLATION_FUNCTION_NAME
> 
> What will be easier to read?

I don't think so. Unless you need both MY_CHARSET_FUNCTION_NAME and
MY_COLLATION_FUNCTION_NAME in one include file. But you apparently
don't.

By the way, could you please rename sql/sql_plugin_services.h to
sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic?

As you're introducing a new convention, let's follow it consistently.

> >> +#define MY_FUNCTION_NAME(x)   my_ ## x ## _big5_bin
> >> +#define WEIGHT_MB1(x)        ((uchar) (x))
> >> +#define WEIGHT_MB2(x,y)      (big5code(x, y))
> >> +#include "ctype-strcoll.ic"
> >> +
> >> diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic
> >> new file mode 100644
> >> index 0000000..7217f99
> >> --- /dev/null
> >> +++ b/strings/ctype-strcoll.ic
> >> @@ -0,0 +1,208 @@
> >> +/*
> >> +   Copyright (c) 2015, MariaDB Foundation
> >> +
> >> +   This program is free software; you can redistribute it and/or modify
> >> +   it under the terms of the GNU General Public License as published by
> >> +   the Free Software Foundation; version 2 of the License.
> >> +
> >> +   This program is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> +   GNU General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU General Public License
> >> +   along with this program; if not, write to the Free Software
> >> +   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> >> +*/
> >> +
> >> +
> >> +#ifndef MY_FUNCTION_NAME
> >> +#error MY_FUNCTION_NAME is not defined
> >> +#endif
> >> +
> >> +
> >> +/*
> >> +  The weight for automatically padded spaces when comparing strings with
> >> +  the PAD SPACE property.
> >> +  Should normally be equal to the weight of a regular space.
> >
> > it is not a space for sjis and cp932? what does that mean?
> > padding space is not the same as a regular space? why?
> 
> It is the same.
> 
> - For sjis and cp932 wight for a single byte character is just
> defined in a more complex way:
> 
> #define WEIGHT_MB1(x)        (256 * (int) (uchar) (x))
> #define WEIGHT_PAD_SPACE     (256 * (int) ' ')
> 
> So regular and pad spaces have the same weight again: 0x2000

In that case it might've been easier to read (less questions asked) if
you'd define it unconditionally as

   #define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')

> >> +/*
> >> +  Weight of an illegal byte.
> >> +  Must be greater than weight of any normal character.
> >> +  Two bad bytes are compared binary.
> >> +*/
> >> +#ifndef WEIGHT_ILSEQ
> >> +#define WEIGHT_ILSEQ(x)   (0xFF00 + (x))
> >
> > why #ifndef? it creates an impression that the caller (includer :)
> > can redefine it. better define WEIGHT_ILSEQ here unconditionally
> > and undefine it at the end
> 
> You got a correct impression.
> 
> The next step is to implement Asian MB3 character sets, eucjpms and
> ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will 
> redefine it, most likely like this:
> 
> #define WEIGHT_ILSEQ(x)   (0xFFFF00 + (x))

I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one
thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only
depends on the max_char_length.

I cannot immediately suggest a better definition though (constant and
easy to understand), so it's ok to keep it like you've done.

Regards,
Sergei

Follow ups

Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-06-26

References

MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-05-24
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Sergei Golubchik, 2015-06-18
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
From: Alexander Barkov, 2015-06-23