maria-developers team mailing list archive

Thread
Date
Re: bf677c554ad: UTF8 and utf8

To: Rucha Deodhar <rucha.deodhar@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Mon, 8 Mar 2021 14:28:30 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <JWH0vrNqroG.A.y3E.QUiRgB@pweza>
Hi, Rucha!

On Mar 08, Rucha Deodhar wrote:
> revision-id: bf677c554ad (mariadb-10.5.2-402-gbf677c554ad)
> parent(s): a1542f8a573
> author: Rucha Deodhar <rucha.deodhar@xxxxxxxxxxx>
> committer: Rucha Deodhar <rucha.deodhar@xxxxxxxxxxx>
> timestamp: 2021-02-22 00:37:18 +0530
> message:
> 
> UTF8 and utf8

Two main comments:

* your global search-and-replace utf8 to utf8mb3 everywhere has replaced
  way too much. Please, revert it.

* why did you have so many changes in tests with utf8mb4? You set the
  old-mode to be utf8_is_utf8mb3 by default. It should've applied to
  tests too, right? So why tests have switched to utf8mb4?

Other comments below.

> diff --git a/debian/additions/innotop/innotop b/debian/additions/innotop/innotop
> index 19229a57a82..4f1e76ea497 100644
> --- a/debian/additions/innotop/innotop
> +++ b/debian/additions/innotop/innotop
> @@ -264,9 +264,9 @@ sub get_dbh {
>              MKDEBUG && _d("$dbh: $sql");
>              $dbh->do($sql);
>              MKDEBUG && _d('Enabling charset for STDOUT');
> -            if ( $charset eq 'utf8' ) {
> -               binmode(STDOUT, ':utf8')
> -                  or die "Can't binmode(STDOUT, ':utf8'): $OS_ERROR";
> +            if ( $charset eq 'utf8mb3' ) {
> +               binmode(STDOUT, ':utf8mb3')
> +                  or die "Can't binmode(STDOUT, ':utf8mb3'): $OS_ERROR";

It seems you performed search-and-replace over the whole source tree.
I don't think it was particularly meaningful.

Here, for example, it was wrong, it's perl code and perl charser utf8,
there's no utf8mb3 in perl.

also all internal variables you've renamed, like below:

>              }
>              else {
>                 binmode(STDOUT) or die "Can't binmode(STDOUT): $OS_ERROR";
> diff --git a/sql/field.h b/sql/field.h
> index 4a4f7cee2a5..f672ae7e315 100644
> --- a/sql/field.h
> +++ b/sql/field.h
> @@ -582,7 +582,7 @@ class Virtual_column_info: public Sql_alloc,
>  public:
>    /* Flag indicating  that the field is physically stored in the database */
>    bool stored_in_db;
> -  bool utf8;                                    /* Already in utf8 */
> +  bool utf8mb3;                                    /* Already in utf8 */

I mean, here ^^^

>    bool automatic_name;
>    Item *expr;
>    Lex_ident name;                               /* Name of constraint */
> diff --git a/client/mysql.cc b/client/mysql.cc
> index f2938d4c824..b0ee239c0bc 100644
> --- a/client/mysql.cc
> +++ b/client/mysql.cc
> @@ -1918,6 +1918,10 @@ static int get_options(int argc, char **argv)
>  
>    if ((ho_error=handle_options(&argc, &argv, my_long_options, get_one_option)))
>      return(ho_error);
> +  if (!strcasecmp("utf8",default_charset))
> +  {
> +    default_charset=(char*)"utf8mb3";
> +  }

why is that? (and the same everywhere else)

>  
>    *mysql_params->p_max_allowed_packet= opt_max_allowed_packet;
>    *mysql_params->p_net_buffer_length= opt_net_buffer_length;
> diff --git a/client/mysqldump.c b/client/mysqldump.c
> index 7c363973da2..ad69618fbd1 100644
> --- a/client/mysqldump.c
> +++ b/client/mysqldump.c
> @@ -3160,7 +3160,7 @@ static uint get_table_structure(const char *table, const char *db, char *table_t
>  
>            fprintf(sql_file,
>                    "SET @saved_cs_client     = @@character_set_client;\n"
> -                  "SET character_set_client = utf8;\n"
> +                  "SET character_set_client = utf8mb3;\n"

why is that?

>                    "/*!50001 CREATE TABLE %s (\n",
>                    result_table);
>  
> diff --git a/include/m_ctype.h b/include/m_ctype.h
> index 5fa8f28ff7a..f1a8166d9df 100644
> --- a/include/m_ctype.h
> +++ b/include/m_ctype.h
> @@ -223,6 +223,7 @@ extern MY_UNI_CTYPE my_uni_ctype[256];
>  #define MY_CS_NON1TO1 0x40000  /* Has a complex mapping from characters
>                                    to weights, e.g. contractions, expansions,
>                                    ignorable characters */
> +#define MY_CS_UTF8_IS_UTF8MB3 0x80000

this isn't a MY_CS_xxx flag, it doesn't belong here,
see all other MY_CS_xxx flags - they are *properties* of a charset.
They are set as  

struct charset_info_st my_charset_tis620_thai_ci=
{
    18,0,0,		/* number    */
    MY_CS_COMPILED|MY_CS_PRIMARY|MY_CS_STRNXFRM|MY_CS_NON1TO1, /* state     */
    charset_name_tis620,		/* cs name    */
    "tis620_thai_ci",	/* name      */

that is, when defining a charset.
See below

>  #define MY_CHARSET_UNDEFINED 0
>  
>  /* Character repertoire flags */
> diff --git a/man/mysql.1 b/man/mysql.1
> index 03f23df3660..5ad953ba5eb 100644
> --- a/man/mysql.1
> +++ b/man/mysql.1
> @@ -401,7 +401,7 @@ Use
>  as the default character set for the client and connection\&.
>  .sp
>  A common issue that can occur when the operating system uses
> -utf8
> +utf8mb3

this didn't need changing

>  or another multi\-byte character set is that output from the
>  \fBmysql\fR
>  client is formatted incorrectly, due to the fact that the MariaDB client uses the
> diff --git a/mysys/charset.c b/mysys/charset.c
> index 32cfeb56e2d..9d035406e8d 100644
> --- a/mysys/charset.c
> +++ b/mysys/charset.c
> @@ -361,7 +361,7 @@ static int add_collation(struct charset_info_st *cs)
>          newcs->state|= MY_CS_AVAILABLE | MY_CS_LOADED | MY_CS_NONASCII;
>  #endif        
>        }
> -      else if (!strcmp(cs->csname, "utf8") || !strcmp(cs->csname, "utf8mb3"))
> +      else if (!strcmp(cs->csname, "utf8mb3") || !strcmp(cs->csname, "utf8mb3"))

one more result of your mechanical search-and-replace,
now you compare with "utf8mb3" twice.

>        {
>  #if defined (HAVE_CHARSET_utf8mb3) && defined(HAVE_UCA_COLLATIONS)
>          copy_uca_collation(newcs, newcs->state & MY_CS_NOPAD ?
> @@ -767,7 +767,7 @@ static const char*
>  get_charset_name_alias(const char *name)
>  {
>    if (!my_strcasecmp(&my_charset_latin1, name, "utf8mb3"))
> -    return "utf8";
> +    return "utf8mb3";

and again

>    return NULL;
>  }
>  
> @@ -1004,13 +1004,19 @@ CHARSET_INFO *
>  get_charset_by_csname(const char *cs_name, uint cs_flags, myf flags)
>  {
>    MY_CHARSET_LOADER loader;
> +
> +  if (!strcasecmp(cs_name, "utf8"))
> +  {
> +    cs_name = (const char*)(cs_flags & MY_CS_UTF8_IS_UTF8MB3 ? "utf8mb3" : "utf8mb4");

your flag should be in `myf flags` not in `uint cs_flags`
(and defined in my_sys.h with other myf flags)

> +  }
> +
>    my_charset_loader_init_mysys(&loader);
>    return my_charset_get_by_name(&loader, cs_name, cs_flags, flags);
>  }
>  
>  
>  /**
> -  Resolve character set by the character set name (utf8, latin1, ...).
> +  Resolve character set by the character set name (utf8mb3, latin1, ...).
>  
>    The function tries to resolve character set by the specified name. If
>    there is character set with the given name, it is assigned to the "cs"
> @@ -1453,8 +1459,8 @@ static const MY_CSET_OS_NAME charsets[] =
>  
>    {"US-ASCII",       "latin1",   my_cs_approx},
>  
> -  {"utf8",           "utf8",     my_cs_exact},
> -  {"utf-8",          "utf8",     my_cs_exact},
> +  {"utf8mb3",           "utf8mb3",     my_cs_exact},
> +  {"utf-8mb3",          "utf8mb3",     my_cs_exact},

utf-8mb3 really? :)

>  #endif
>    {NULL,             NULL,       0}
>  };
> diff --git a/plugin/win_auth_client/common.cc b/plugin/win_auth_client/common.cc
> index 8b7319252ac..f46b0ec9696 100644
> --- a/plugin/win_auth_client/common.cc
> +++ b/plugin/win_auth_client/common.cc
> @@ -338,7 +338,7 @@ UPN::UPN(): m_buf(NULL)
>    m_buf= wchar_to_utf8(buf1, &m_len);
>  
>    if(!m_buf)
> -    ERROR_LOG(ERROR, ("Failed to convert UPN to utf8"));
> +    ERROR_LOG(ERROR, ("Failed to convert UPN to utf8mb3"));

probably wrong too (everything in this file), see the code - it uses
windows utf8 charset, not mariadb's utf8

>  
>    // Note: possible error would be indicated by the fact that m_buf is NULL.
>    return;
> diff --git a/sql/mysqld.cc b/sql/mysqld.cc
> index 0bf21e02002..fd131497f8c 100644
> --- a/sql/mysqld.cc
> +++ b/sql/mysqld.cc
> @@ -4053,6 +4056,24 @@ static int init_common_variables()
>        break;
>    }
>  
> +    if (default_collation_name)
> +    {
> +    char *copy_of_name= (char*)default_collation_name;
> +    char start[6];
> +    strncpy(start, default_collation_name, 5);
> +    char *fname= (char *)"utf8mb3_";      
> +    if (! strncasecmp("utf8_", start,5))
> +    {
> +     copy_of_name+= 5;
> +     char result[64];
> +     result[63]='\0';
> +     strcpy(result, fname);
> +     strcat(result, copy_of_name);
> +     result[strlen(copy_of_name)+strlen(fname)]='\0';
> +     default_collation_name= (char *) result;
> +    }
> +    }

1. wrong indentation
2. shouldn't it be in the get_charset_by_name ?
3. shouldn't it depend on the UTF8_IS_UTF8MB3 ?

> +  
>    if (default_collation_name)
>    {
>      CHARSET_INFO *default_collation;
> diff --git a/sql/sql_class.h b/sql/sql_class.h
> index 50b746fe514..f0616b40361 100644
> --- a/sql/sql_class.h
> +++ b/sql/sql_class.h
> @@ -5248,6 +5249,65 @@ class THD: public THD_count, /* this must be first */
>    Item *sp_prepare_func_item(Item **it_addr, uint cols= 1);
>    bool sp_eval_expr(Field *result_field, Item **expr_item_ptr);
>  
> +  CHARSET_INFO *get_charset_by_csname(const char *cs_name,
> +                                      uint cs_flags,
> +                                      myf myf_flags) const
> +  {
> +    return ::get_charset_by_csname(cs_name,
> +                                   this->variables.old_behavior & OLD_MODE_UTF8_IS_UTF8MB3 ?
> +                                   cs_flags | MY_CS_UTF8_IS_UTF8MB3 : cs_flags, 
> +                                   myf_flags);
> +  }
> +  
> +  inline const char* get_collation_or_charset_name(const char* name)

shouldn't the name be get_collation_name_alias() ?

all other method names you've created match corresponding
charset-related function names

> +  {
> +     char *copy_of_name= (char*)name;
> +     char start[6];
> +     strncpy(start, name, 5);
> +     char *fname= (char *)(this->variables.old_behavior & OLD_MODE_UTF8_IS_UTF8MB3?"utf8mb3_":"utf8mb4_");
> +      if (!strncasecmp("utf8_", start,5))
> +      {
> +        copy_of_name+= 5;
> +        char result[64];
> +        result[63]='\0';
> +        strcpy(result, fname);
> +        strcat(result, copy_of_name);
> +        result[strlen(copy_of_name)+strlen(fname)]='\0';
> +        name= (const char *) result;
> +      }
> +      return name;

here you return a pointer to the local variable

also: indentation, too long line, end-of-line spaces

> +  }
> +  
> +  inline CHARSET_INFO *
> +  mysqld_collation_get_by_name(const char *name,
> +                             CHARSET_INFO *name_cs= system_charset_info)
> +  {
> +    name=this->get_collation_or_charset_name(name);
> +    return ::mysqld_collation_get_by_name(name, name_cs);
> +}
> +
> +CHARSET_INFO *get_charset_by_name(const char *cs_name, myf flags)
> +{
> +
> +  cs_name=this->get_collation_or_charset_name(cs_name);
> +  return ::get_charset_by_name(cs_name,flags);

here it'd be better to pass MY_UTF8_IS_UTF8MB3 in flags

> +}
> +my_bool resolve_charset(const char *cs_name,
> +                        CHARSET_INFO *default_cs,
> +                        CHARSET_INFO **cs)
> +{
> +  cs_name=this->get_collation_or_charset_name(cs_name);
> +  return ::resolve_charset(cs_name,default_cs,cs);

perhaps it'd make sense to introduce flags argument here too?

> +}
> +
> +my_bool resolve_collation(const char *cl_name,
> +                          CHARSET_INFO *default_cl,
> +                          CHARSET_INFO **cl)
> +{
> +   cl_name= this->get_collation_or_charset_name(cl_name);
> +   return ::resolve_collation(cl_name,default_cl,cl);
> +}

please make sure the indentation is correct, I won't comment on it anymore

> +
>  };
>  
>  
> diff --git a/sql/sql_db.cc b/sql/sql_db.cc
> index 9bf16220535..85e0e0dbd2f 100644
> --- a/sql/sql_db.cc
> +++ b/sql/sql_db.cc
> @@ -583,9 +583,13 @@ bool load_db_opt(THD *thd, const char *path, Schema_specification_st *create)
>             default-collation commands.
>          */
>          if (!(create->default_table_charset=
> -        get_charset_by_csname(pos+1, MY_CS_PRIMARY, MYF(0))) &&
> +        thd->get_charset_by_csname(pos+1, thd->variables.old_behavior &
> +                                          OLD_MODE_UTF8_IS_UTF8MB3 ? 
> +                                          MY_CS_UTF8_IS_UTF8MB3 | MY_CS_PRIMARY :
> +                                          MY_CS_PRIMARY,
> +                                          MYF(0))) &&

that's a bit redundant, don't you think?
you use both thd-> method that replaces utf_ with a real charset name
and *also* you set MY_CS_UTF8_IS_UTF8MB3 flag.

there are other places like that, to keep the review shorter I won't comment
on them below.

>              !(create->default_table_charset=
> -              get_charset_by_name(pos+1, MYF(0))))
> +              thd->get_charset_by_name(pos+1, MYF(0))))
>          {
>            sql_print_error("Error while loading database options: '%s':",path);
>            sql_print_error(ER_THD(thd, ER_UNKNOWN_CHARACTER_SET),pos+1);
> diff --git a/sql/sys_vars.cc b/sql/sys_vars.cc
> index a75066271d9..fd480bdced2 100644
> --- a/sql/sys_vars.cc
> +++ b/sql/sys_vars.cc
> @@ -3705,7 +3709,7 @@ static Sys_var_set Sys_old_behavior(
>         "old_mode",
>         "Used to emulate old behavior from earlier MariaDB or MySQL versions",
>         SESSION_VAR(old_behavior), CMD_LINE(REQUIRED_ARG),
> -       old_mode_names, DEFAULT(0));
> +       old_mode_names, DEFAULT(8));

DEFAULT(OLD_MODE_UTF8_IS_UTF8MB3) please

>  
>  #if defined(HAVE_OPENSSL) && !defined(EMBEDDED_LIBRARY)
>  #define SSL_OPT(X) CMD_LINE(REQUIRED_ARG,X)
> diff --git a/storage/connect/ha_connect.cc b/storage/connect/ha_connect.cc
> index 849cf7594eb..b6e8f3b3226 100644
> --- a/storage/connect/ha_connect.cc
> +++ b/storage/connect/ha_connect.cc
> @@ -6099,8 +6099,8 @@ static int connect_assisted_discovery(handlerton *, THD* thd,
>  							case FLD_NAME:
>  								if (ttp == TAB_PRX ||
>  									(ttp == TAB_CSV && topt->data_charset &&
> -									(!stricmp(topt->data_charset, "UTF8") ||
> -										!stricmp(topt->data_charset, "UTF-8"))))
> +									(!stricmp(topt->data_charset, "UTF8MB3") ||
> +										!stricmp(topt->data_charset, "UTF-8MB3"))))

UTF-8MB3 again

>  									cnm= crp->Kdata->GetCharValue(i);
>  								else
>  									cnm= encode(g, crp->Kdata->GetCharValue(i));
> diff --git a/storage/mroonga/vendor/groonga/CMakeLists.txt b/storage/mroonga/vendor/groonga/CMakeLists.txt
> index d271d4c4eb9..fa5d6f97ddd 100644
> --- a/storage/mroonga/vendor/groonga/CMakeLists.txt
> +++ b/storage/mroonga/vendor/groonga/CMakeLists.txt
> @@ -95,7 +95,7 @@ set(GRN_LOG_PATH
>    "${CMAKE_INSTALL_PREFIX}/var/log/${GRN_PROJECT_NAME}/${GRN_PROJECT_NAME}.log"
>    CACHE FILEPATH "log file path")
>  set(GRN_DEFAULT_ENCODING
> -  "utf8"
> +  "utf8mb3"

utf8 is better here

>    CACHE STRING "Groonga's default encoding")
>  set(GRN_DEFAULT_MATCH_ESCALATION_THRESHOLD
>    0
> diff --git a/storage/perfschema/unittest/pfs_connect_attr-t.cc b/storage/perfschema/unittest/pfs_connect_attr-t.cc
> index b57ead3ec26..7e5b511d543 100644
> --- a/storage/perfschema/unittest/pfs_connect_attr-t.cc
> +++ b/storage/perfschema/unittest/pfs_connect_attr-t.cc
> @@ -343,7 +343,12 @@ int main(int, char **)
>  {
>    MY_INIT("pfs_connect_attr-t");
>  
> -  cs_cp1251= get_charset_by_csname("cp1251", MY_CS_PRIMARY, MYF(0));
> +  cs_cp1251= thd->get_charset_by_csname("cp1251",
> +                                         thd->variables.old_behavior &
> +                                         OLD_MODE_UTF8_IS_UTF8M3 ?
> +                                         MY_CS_PRIMARY | MY_CS_UTF8_IS_UTF8MB3 :

well, here there is surely no reason to use MY_CS_UTF8_IS_UTF8MB3,
because the charset is a hard-coded literal and it's not "utf8", is it?

> +                                         MY_CS_PRIMARY,
> +                                         MYF(0));
>    if (!cs_cp1251)
>      diag("skipping the cp1251 tests : missing character set");
>    plan(59 + (cs_cp1251 ? 10 : 0));
> diff --git a/strings/ctype.c b/strings/ctype.c
> index 46951c3ae1f..9054cbfc3e6 100644
> --- a/strings/ctype.c
> +++ b/strings/ctype.c
> @@ -37,7 +37,7 @@
>  */
>  
>  const char charset_name_latin2[]= "latin2";
> -const char charset_name_utf8[]= "utf8";
> +const char charset_name_utf8[]= "utf8mb3";

better rename the variable to charset_name_utf8mb3

>  const char charset_name_utf16[]= "utf16";
>  const char charset_name_utf32[]= "utf32";
>  const char charset_name_ucs2[]= "ucs2";
> diff --git a/tests/mysql_client_test.c b/tests/mysql_client_test.c
> index 0043786d477..37da9e1f62e 100644
> --- a/tests/mysql_client_test.c
> +++ b/tests/mysql_client_test.c
> @@ -405,7 +405,7 @@ static void test_prepare_simple()
>  
>    /* update */
>    strmov(query, "UPDATE test_prepare_simple SET id=? "
> -                "WHERE id=? AND CONVERT(name USING utf8)= ?");
> +                "WHERE id=? AND CONVERT(name USING utf8mb3)= ?");

I suspect all tests should better use "utf8"

>    stmt= mysql_simple_prepare(mysql, query);
>    check_stmt(stmt);
>  
 
Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx