maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #08756
Re: MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
Hi, Alexander!
On Jun 23, Alexander Barkov wrote:
> >
> >> +#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_chinese_ci
> >> +#define WEIGHT_MB1(x) (sort_order_big5[(uchar) (x)])
> >> +#define WEIGHT_MB2(x,y) (big5code(x, y))
> >> +#include "ctype-strcoll.ic"
> >> +
> >> +
> >> +#undef MY_FUNCTION_NAME
> >> +#undef WEIGHT_MB1
> >> +#undef WEIGHT_MB2
> >
> > alternatively you could undef them at the end of ctype-strcoll.ic
> > it'll also be a protection against including this file twice
> > (although it cannot be done now anyway)
>
> Thanks for the idea. This can certainly be done for
> WEIGHT_MB1 and WEIGHT_MB2.
>
> What do you suggest to do with MY_FUNCTION_NAME?
> It's now used in ctype-strcoll.ic in collation name context,
> and in ctype-mb.ic in character name context.
>
> So the structure of a file is about like this:
>
> #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis
> #define ... character set specific definitions
> #include "ctype-mb.ic"
> ...
> #undef MY_FUNCTION_NAME
> #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_japanese_ci
> #define ... ci collation specific definitions
> #include "ctype-strcoll.ic"
> ...
> #undef MY_FUNCTION_NAME
> #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_bin
> #define ... bin collation specific definitions
> #include "ctype-strcoll.ic"
> ...
>
> Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and
> ctype-strncoll.ic
Yes, I would've done that, I guess.
> Another approach would be to have two things:
>
> MY_CHARSET_FUNCTION_NAME
> MY_COLLATION_FUNCTION_NAME
>
> What will be easier to read?
I don't think so. Unless you need both MY_CHARSET_FUNCTION_NAME and
MY_COLLATION_FUNCTION_NAME in one include file. But you apparently
don't.
By the way, could you please rename sql/sql_plugin_services.h to
sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic?
As you're introducing a new convention, let's follow it consistently.
> >> +#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_bin
> >> +#define WEIGHT_MB1(x) ((uchar) (x))
> >> +#define WEIGHT_MB2(x,y) (big5code(x, y))
> >> +#include "ctype-strcoll.ic"
> >> +
> >> diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic
> >> new file mode 100644
> >> index 0000000..7217f99
> >> --- /dev/null
> >> +++ b/strings/ctype-strcoll.ic
> >> @@ -0,0 +1,208 @@
> >> +/*
> >> + Copyright (c) 2015, MariaDB Foundation
> >> +
> >> + This program is free software; you can redistribute it and/or modify
> >> + it under the terms of the GNU General Public License as published by
> >> + the Free Software Foundation; version 2 of the License.
> >> +
> >> + This program is distributed in the hope that it will be useful,
> >> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> >> + GNU General Public License for more details.
> >> +
> >> + You should have received a copy of the GNU General Public License
> >> + along with this program; if not, write to the Free Software
> >> + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> >> +*/
> >> +
> >> +
> >> +#ifndef MY_FUNCTION_NAME
> >> +#error MY_FUNCTION_NAME is not defined
> >> +#endif
> >> +
> >> +
> >> +/*
> >> + The weight for automatically padded spaces when comparing strings with
> >> + the PAD SPACE property.
> >> + Should normally be equal to the weight of a regular space.
> >
> > it is not a space for sjis and cp932? what does that mean?
> > padding space is not the same as a regular space? why?
>
> It is the same.
>
> - For sjis and cp932 wight for a single byte character is just
> defined in a more complex way:
>
> #define WEIGHT_MB1(x) (256 * (int) (uchar) (x))
> #define WEIGHT_PAD_SPACE (256 * (int) ' ')
>
> So regular and pad spaces have the same weight again: 0x2000
In that case it might've been easier to read (less questions asked) if
you'd define it unconditionally as
#define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')
> >> +/*
> >> + Weight of an illegal byte.
> >> + Must be greater than weight of any normal character.
> >> + Two bad bytes are compared binary.
> >> +*/
> >> +#ifndef WEIGHT_ILSEQ
> >> +#define WEIGHT_ILSEQ(x) (0xFF00 + (x))
> >
> > why #ifndef? it creates an impression that the caller (includer :)
> > can redefine it. better define WEIGHT_ILSEQ here unconditionally
> > and undefine it at the end
>
> You got a correct impression.
>
> The next step is to implement Asian MB3 character sets, eucjpms and
> ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will
> redefine it, most likely like this:
>
> #define WEIGHT_ILSEQ(x) (0xFFFF00 + (x))
I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one
thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only
depends on the max_char_length.
I cannot immediately suggest a better definition though (constant and
easy to understand), so it's ok to keep it like you've done.
Regards,
Sergei
Follow ups
References