maria-developers team mailing list archive

Thread
Date

Re: 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Thu, 26 May 2022 14:16:18 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <YjIqMpIHKBzEFGUj@pweza>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

Hello Sergei,

Thanks for the review.


Please review the new set of UCA-14.0.0 patches here:

https://github.com/MariaDB/server/tree/bb-10.9-bar-uca14


Please see comments below:


On 3/16/22 10:19 PM, Sergei Golubchik wrote:

Hi, Alexander,

On Mar 14, Alexander Barkov wrote:

revision-id: 49ecf935415 (mariadb-10.6.1-335-g49ecf935415)
parent(s): c67789f63c8
author: Alexander Barkov
committer: Alexander Barkov
timestamp: 2022-02-28 14:04:58 +0400
message:

MDEV-27009 Add UCA-14.0.0 collations


please, list all user visible changes there. Mainly that
collations are now decoupled from charsets. New syntax in CREATE
TABLE, changes in I_S tables, etc.


Added.

By the way, perhaps some of these statements should display
short collation names:

  SHOW CREATE TABLE t1;
  SHOW CREATE DATABASE db1;
  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
  SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
  SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;

Can we discuss this?

diff --git a/mysql-test/include/ctype_utf_uca1400_ids.inc b/mysql-test/include/ctype_utf_uca1400_ids.inc
new file mode 100644
index 00000000000..09cf49fc0e7
--- /dev/null
+++ b/mysql-test/include/ctype_utf_uca1400_ids.inc
@@ -0,0 +1,17 @@


file names are confusing. better rename ctype_ucs_uca1400_ids.inc
to something like ctype_convert_uca1400_ids
and ctype_utf_uca1400_ids to ctype_set_names_uca1400_ids
or something like that, to show what they do.


Renamed to

ctype_uca1400_ids_using_convert.inc
ctype_uca1400_ids_using_set_names.inc

+
+--disable_ps_protocol
+--enable_metadata
+DELIMITER $$;
+FOR rec IN (SELECT COLLATION_NAME
+            FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
+            WHERE CHARACTER_SET_NAME=@charset
+              AND COLLATION_NAME RLIKE 'uca1400'
+            ORDER BY ID)
+DO
+  EXECUTE IMMEDIATE CONCAT('SET NAMES ',@charset,' COLLATE ', rec.COLLATION_NAME);
+  SELECT rec.COLLATION_NAME;
+END FOR;
+$$
+DELIMITER ;$$
+--disable_metadata
+--enable_ps_protocol
diff --git a/include/m_ctype.h b/include/m_ctype.h
index 4c6628b72b3..706764ead2a 100644
--- a/include/m_ctype.h
+++ b/include/m_ctype.h
@@ -34,7 +34,9 @@ enum loglevel {
  extern "C" {
  #endif

-#define MY_CS_NAME_SIZE 32

+#define MY_CS_CHARACTER_SET_NAME_SIZE   32
+#define MY_CS_COLLATION_NAME_SIZE       64


That's FULL_COLLATION_NAME_SIZE, right?


I think we can have just one at this point,
which fits any collation name (full and short).

+
  #define MY_CS_CTYPE_TABLE_SIZE		257
  #define MY_CS_TO_LOWER_TABLE_SIZE	256
  #define MY_CS_TO_UPPER_TABLE_SIZE	256
@@ -240,6 +242,46 @@ typedef enum enum_repertoire_t
  } my_repertoire_t;

+/* ID compatibility */

+typedef enum enum_collation_id_type
+{
+  MY_COLLATION_ID_TYPE_PRECISE=          0,
+  MY_COLLATION_ID_TYPE_COMPAT_100800=    1
+} my_collation_id_type_t;
+
+
+/* Collation name display modes */
+typedef enum enum_collation_name_mode
+{
+  MY_COLLATION_NAME_MODE_FULL=                                 0,
+  MY_COLLATION_NAME_MODE_CONTEXT=                              1
+} my_collation_name_mode_t;
+
+
+/* Level flags */
+#define MY_CS_LEVEL_BIT_PRIMARY    0x00
+#define MY_CS_LEVEL_BIT_SECONDARY  0x01
+#define MY_CS_LEVEL_BIT_TERTIARY   0x02
+#define MY_CS_LEVEL_BIT_QUATERNARY 0x03
+
+#define MY_CS_COLL_LEVELS_S1       (1<<MY_CS_LEVEL_BIT_PRIMARY)
+
+#define MY_CS_COLL_LEVELS_AI_CS    (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
+                                   (1<<MY_CS_LEVEL_BIT_TERTIARY)
+
+#define MY_CS_COLL_LEVELS_S2       (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
+                                   (1<<MY_CS_LEVEL_BIT_SECONDARY)
+
+#define MY_CS_COLL_LEVELS_S3       (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
+                                   (1<<MY_CS_LEVEL_BIT_SECONDARY) | \
+                                   (1<<MY_CS_LEVEL_BIT_TERTIARY)


AI_CS and S3 don't seem to be used yet


Right, there are no old _AI_CS and _AS_CS (aka S3) collations.


New _AI_CS and _AS_CS collations definitions
are initialized by this function:

my_uca1400_collation_definition_init(MY_CHARSET_LOADER *loader,
                                     struct charset_info_st *dst,
                                     uint id)

Level flags are calculated by this function from "id".


So there are no hard-coded definitions with
MY_CS_COLL_LEVELS_AI_CS and MY_CS_COLL_LEVELS_S3 either.


Should I remove these definitions?

+
+#define MY_CS_COLL_LEVELS_S4       (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
+                                   (1<<MY_CS_LEVEL_BIT_SECONDARY) | \
+                                   (1<<MY_CS_LEVEL_BIT_TERTIARY)  | \
+                                   (1<<MY_CS_LEVEL_BIT_QUATERNARY)
+
+
  /* Flags for strxfrm */
  #define MY_STRXFRM_LEVEL1          0x00000001 /* for primary weights   */
  #define MY_STRXFRM_LEVEL2          0x00000002 /* for secondary weights */
diff --git a/sql/sql_alter.cc b/sql/sql_alter.cc
index 86c6e9a27f8..9ddd482ad57 100644
--- a/sql/sql_alter.cc
+++ b/sql/sql_alter.cc
@@ -546,6 +546,7 @@ bool Sql_cmd_alter_table::execute(THD *thd)

result= mysql_alter_table(thd, &select_lex->db, &lex->name,

                              &create_info,
+                            lex->create_info.default_charset_collation,


I don't see why you need a new argument here. It's
create_info.default_charset_collation, so, mysql_alter_table already gets
it in create_info. All other mysql_alter_table invocations also
take create_info argument and can get default_charset_collation from there


I extracted this part and pushed it separately under terms of this bug fix:

commit 208addf48444c0a36a2cc16cd2558ae694e905d5
Author: Alexander Barkov <bar@xxxxxxxxxxx>
Date:   Tue May 17 12:52:23 2022 +0400

Main patch MDEV-27896 Wrong result upon `COLLATE latin1_binCHARACTER SET latin1` on the table or the database level



As you suggested, I did not add the new paramenter,
I changed the data type of "create_info" instead:


 bool mysql_alter_table(THD *thd, const LEX_CSTRING *new_db,
                        const LEX_CSTRING *new_name,
-                       HA_CREATE_INFO *create_info,
+                       Table_specification_st *create_info,

                              first_table,
                              &alter_info,
                              select_lex->order_list.elements,
diff --git a/sql/sql_partition_admin.cc b/sql/sql_partition_admin.cc
index fb1ae0d5fc7..4188dde252b 100644
--- a/sql/sql_partition_admin.cc
+++ b/sql/sql_partition_admin.cc
@@ -211,6 +211,7 @@ bool compare_table_with_partition(THD *thd, TABLE *table, TABLE *part_table,
    part_table->use_all_columns();
    table->use_all_columns();
    if (unlikely(mysql_prepare_alter_table(thd, part_table, &part_create_info,
+                                         Lex_maybe_default_charset_collation(),


Same. Can be in part_create_info


Same here:

 mysql_prepare_alter_table(THD *thd, TABLE *table,
-                          HA_CREATE_INFO *create_info,
+                          Table_specification_st *create_info,

                                           &part_alter_info, &part_alter_ctx)))
    {
      my_error(ER_TABLES_DIFFERENT_METADATA, MYF(0));
diff --git a/sql/sql_i_s.h b/sql/sql_i_s.h
index bed2e886718..5ff06d32231 100644
--- a/sql/sql_i_s.h
+++ b/sql/sql_i_s.h
@@ -162,6 +162,11 @@ class Yesno: public Varchar
  {
  public:
    Yesno(): Varchar(3) { }
+  static LEX_CSTRING value(bool val)
+  {
+    return val ? Lex_cstring(STRING_WITH_LEN("Yes")) :
+                 Lex_cstring();
+  }


eh... please, rename the class from Yesno to something like
Yesempty or Yes_or_empty, something that says that the second
should not be Lex_cstring(STRING_WITH_LEN("No"))


Renamed and pushed as a separate commit:

commit 821808c45dd3c5d4bc98cd04810732f647872747 (origin/bb-10.5-bar)
Author: Alexander Barkov <bar@xxxxxxxxxxx>
Date:   Thu Apr 28 11:23:12 2022 +0400

    A clean-up for "MDEV-19772 Add helper classes for ST_FIELD_INFO"

    As agreed with Serg, renaming class Yesno to Yes_or_empty,
    to reflect better its behavior.

};

diff --git a/sql/table.cc b/sql/table.cc

index a683a78ff49..c28cb2bd928 100644
--- a/sql/table.cc
+++ b/sql/table.cc
@@ -3491,6 +3493,16 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write,
    else
      thd->set_n_backup_active_arena(arena, &backup);

+ /*

+    THD::reset_db() does not set THD::db_charset,
+    so it keeps pointing to the character set and collation
+    of the current database, rather than the database of the
+    new initialized table.


Hmm, is that correct? Could you check other invocation of
thd->reset_db()? Perhaps they all need to switch charset?
In that case it should be done inside THD::reset_db().

Or may be they have to use mysql_change_db_impl() instead?


Note, this part was moved to MDEV-27896.
It's not a part of UCA14 patches any more.


Anyway, I checked invocation of thd->reset_db() and did not find
a general rule quickly. From a glance, they mostly don't seem to
need to switch the charset. But it needs to be investigated further.
Should I create an MDEV for this?

+    Let's call get_default_db_collation() before reset_db().
+    This forces the db.opt file to be loaded.
+  */
+  db_cs= get_default_db_collation(thd, db.str);
+
    thd->reset_db(&db);
    lex_start(thd);

@@ -3498,6 +3510,11 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write,

                  sql_unusable_for_discovery(thd, hton, sql_copy))))
      goto ret;

+ if (!(thd->lex->create_info.default_table_charset=

+         thd->lex->create_info.default_charset_collation.
+           resolved_to_character_set(db_cs, db_cs)))
+    DBUG_RETURN(true);


How could this (and similar if()'s in other files) fail?


It can fail in this scenario:

CREATE TABLE t1 (a CHAR(10) COLLATE uca1400_cs_ci) CHARACTER SET latin1;

UCA collations are not applicable to latin1 yet.


Btw, this part now looks differenlty.
See HA_CREATE_INFO::resolve_to_charset_collation_context()
in sql_table.cc:


    if (!(default_table_charset=
            default_cscl.resolved_to_context(ctx)))
      return true;

+
    thd->lex->create_info.db_type= hton;
  #ifdef WITH_PARTITION_STORAGE_ENGINE
    thd->work_part_info= 0;                       // For partitioning
diff --git a/sql/mysys_charset.h b/sql/mysys_charset.h
new file mode 100644
index 00000000000..86eaeedd432
--- /dev/null
+++ b/sql/mysys_charset.h
@@ -0,0 +1,44 @@
+#ifndef MYSYS_CHARSET
+#define MYSYS_CHARSET
+
+/* Copyright (c) 2021, MariaDB Corporation.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; version 2 of the License.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program; if not, write to the Free Software
+   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335  USA */
+
+
+#include "my_sys.h"
+
+
+class Charset_loader_mysys: public MY_CHARSET_LOADER
+{
+public:
+  Charset_loader_mysys()
+  {
+    my_charset_loader_init_mysys(this);
+  }
+  void raise_unknown_collation_error(const char *name,
+                                     CHARSET_INFO *name_cs) const;
+  CHARSET_INFO *get_charset(const char *cs_name, uint cs_flags, myf my_flags);
+  CHARSET_INFO *get_exact_collation(const char *name, myf utf8_flag);
+  CHARSET_INFO *get_contextually_typed_collation(CHARSET_INFO *cs,
+                                                 const char *name);
+  CHARSET_INFO *get_contextually_typed_collation(const char *name);
+  CHARSET_INFO *get_contextually_typed_collation_or_error(CHARSET_INFO *cs,
+                                                          const char *name);
+  CHARSET_INFO *find_default_collation(CHARSET_INFO *cs);
+  CHARSET_INFO *find_bin_collation_or_error(CHARSET_INFO *cs);
+};


you can have C++ code in mysys too, you know, no need to put it
in sql/mysys*


This is a good idea.

There was one problem: Charset_loader_mysys pushed errors and warnings
into the server diagnostics area. So it could not sit in
include/my_sys.h as is.

I split it into two parts:

- Charset_loader_mysys is defined in include/my_sys.h
  and does not send any errors/warnings.
  It is self-sufficient and is fully defined in include/my_sys.h.
  It does not have any method implementations in c++ files.

- There is a new class Charset_loader_server.
  It is defined in lex_charset.h as follows:

class Charset_loader_server: public Charset_loader_mysys

  It sends errors and warnings. And has parts implemented
  in lex_charset.cc.

+
+#endif // MYSYS_CHARSET
+
diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c
index b579f0af203..d09dfba86ed 100644
--- a/strings/ctype-simple.c
+++ b/strings/ctype-simple.c
@@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)),
  }

+void my_ci_set_strength(struct charset_info_st *cs, uint strength)

+{
+  DBUG_ASSERT(strength > 0 && strength <= MY_STRXFRM_NLEVELS);


don't use && in asserts, please create two separate asserts instead:

  DBUG_ASSERT(strength > 0);
  DBUG_ASSERT(strength <= MY_STRXFRM_NLEVELS);


Done.

+  cs->levels_for_order= ((1 << strength) - 1);


why do you still use the old concept of "strength"? Why not to use
bitmap consistently everywhere?


The collation definition file Index.xml is based on the LDML syntax.

It uses tags like this:

<settings strength="2"/>

This function is needed to handle these LDML tags.

Btw, to define user-defined _AI_CS collations we'll need
to add an LDML extension eventually.

+}
+
+
+void my_ci_set_level_flags(struct charset_info_st *cs, uint flags)
+{
+  DBUG_ASSERT(flags < (1<<MY_STRXFRM_NLEVELS));
+  cs->levels_for_order= flags;
+}
+
  /*
    Normalize strxfrm flags

SYNOPSIS:

      my_strxfrm_flag_normalize()
+    cs       - the CHARSET_INFO pointer
      flags    - non-normalized flags
-    nlevels  - number of levels

NOTES:

      If levels are omitted, then 1-maximum is assumed.
diff --git a/sql/handler.h b/sql/handler.h
index 8ad521e189a..1e82f37b1e7 100644
--- a/sql/handler.h
+++ b/sql/handler.h
@@ -2409,7 +2386,32 @@ struct Table_specification_st: public HA_CREATE_INFO,
    {
      HA_CREATE_INFO::options= 0;
      DDL_options_st::init();
+    default_charset_collation.init();
+  }
+
+  bool
+  add_alter_list_item_convert_to_charset(const Lex_charset_collation_st &cl)
+  {
+    /*
+      cs cannot be NULL, as sql_yacc.yy translates
+         CONVERT TO CHARACTER SET DEFAULT
+      to
+         CONVERT TO CHARACTER SET <character-set-of-the-current-database>
+      TODO: Shouldn't we postpone resolution of DEFAULT until the
+      character set of the table owner database is loaded from its db.opt?
+    */
+    DBUG_ASSERT(cl.charset_collation());
+    DBUG_ASSERT(!cl.is_contextually_typed_collation());
+    alter_table_convert_to_charset= cl.charset_collation();
+    default_charset_collation.Lex_charset_collation_st::operator=(cl);


looks quite ugly. can you do, like, default_charset_collation.set(cl) ?

This code migrated to MDEV-27896 and is not a part of UCA14 patches anymore.


Now it looks differently, there are no ::operator=(cl) any more.
There are more constructors instead.


Anyway, I'd like to comment:

I agree that it does not look like something we often
use in the MariaDB sources.

But I like it better than set(), because set() would need
the reader to jump over the sources to know what set() actually does.

On the contrary, the line with the operator is very self descriptive.
It's full of information:
"default_charset_collation derives from Lex_charset_collation_st
and here we initialize the Lex_charset_collation_st part of it".

So I think direct use of operator=() makes reading easier.
Ading various set() wrappers around operator=() makes reading harder.

+    used_fields|= (HA_CREATE_USED_CHARSET | HA_CREATE_USED_DEFAULT_CHARSET);
+    return false;
    }
+  bool add_table_option_default_charset(CHARSET_INFO *cs);
+  bool add_table_option_default_collation(const Lex_charset_collation_st &cl);
+  bool resolve_db_charset_and_collation(THD *thd,
+                                        const LEX_CSTRING &db,
+                                        bool is_alter);
  };

diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h

new file mode 100644
index 00000000000..da95dcfde54
--- /dev/null
+++ b/strings/ctype-uca1400data.h
@@ -0,0 +1,44151 @@
+/*
+  Generated from allkeys.txt version '14.0.0'
+*/


if it's generated, do you need to check it in?
perhaps it should be generated during the build?
you've checked in allkeys1400.txt anyway.


Right, we can consider it.

Btw, I've checked it all versions:

$ ls mysql-test/std_data/unicode/

allkeys1400.txt
allkeys400.txt
allkeys520.txt

So we can generate sources for all three UCA versions
from these files.

But I suggest we do it separately.
Should I create an MDEV for this?

+static const uint16 uca1400_p000[]= { /* 0000 (4 weights per char) */
+0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0000 */
+0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0002 */
diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc
index 6ca10267187..d115401a855 100644
--- a/sql/sql_lex.cc
+++ b/sql/sql_lex.cc
@@ -542,6 +542,30 @@ bool LEX::add_alter_list(LEX_CSTRING name, LEX_CSTRING new_name, bool exists)
  }

+bool LEX::add_alter_list_item_convert_to_charset(

+                                             THD *thd,
+                                             CHARSET_INFO *cs,
+                                             const Lex_charset_collation_st &cl)
+{
+  if (!cs)
+  {
+    Lex_charset_collation_st tmp;
+    tmp.set_charset_collate_default(thd->variables.collation_database);


Hmm, what if one is doing ALTER TABLE db.test CHARSET DEFAULT
and current db is not `db` but `test` ?


Right, thanks for noticing this. The problem that
both DEFAULT CHARACTER SET and CONVERT TO did not work
well in some cases existed for a long time.

When I moved MDEV-27896 out of the UCA patches, I reported CONVERT TOproblems in:

MDEV-28644 Unexpected error on ALTER TABLE t1 CONVERT TO CHARACTER SETutf8mb3, DEFAULT CHARACTER SET utf8mb4


The final patch for MDEV-27896 fixed this problem as well,
as it was very easy after fixing DEFAULT CHARACTER SET cs [COLLATE cl].


The idea is that both "DEFAULT CHARACTER SET" and "CONVERT TO" clauses
are now fully independent, and both use the new class
Lex_table_charset_collation_attrs_st as a storage:


struct Table_specification_st: public HA_CREATE_INFO,
                               public DDL_options_st
{
  Lex_table_charset_collation_attrs_st default_charset_collation;
  Lex_table_charset_collation_attrs_st convert_charset_collation;

+    if (!(cs= tmp.charset_collation()))
+      return true; // Should not actually happen


assert?


This code migrated to MDEV-27896 and was changed.
There is no a line like this any more.

Instead, there are classes Lex_exact_charset,
Lex_exact_collation, Lex_context_collation.
They catch NULL in constructors, e.g.:

class Lex_exact_charset
{
  CHARSET_INFO *m_ci;
public:
  explicit Lex_exact_charset(CHARSET_INFO *ci)
   :m_ci(ci)

+  }
+
+  Lex_explicit_charset_opt_collate tmp(cs, false);
+  if (tmp.merge_opt_collate_or_error(cl) ||
+      create_info.add_alter_list_item_convert_to_charset(
+                    Lex_charset_collation(tmp)))
+    return true;
+
+  alter_info.flags|= ALTER_CONVERT_TO;
+  return false;
+}
+
+
  void LEX::init_last_field(Column_definition *field,
                            const LEX_CSTRING *field_name)
  {
@@ -11871,29 +11869,41 @@ CHARSET_INFO *Lex_collation_st::find_default_collation(CHARSET_INFO *cs)
    "def" is the upper level CHARACTER SET clause (e.g. of a table)
  */
  CHARSET_INFO *
-Lex_collation_st::resolved_to_character_set(CHARSET_INFO *def) const
+Lex_charset_collation_st::resolved_to_character_set(CHARSET_INFO *def) const
  {
    DBUG_ASSERT(def);
-  if (m_type != TYPE_CONTEXTUALLY_TYPED)
-  {
-    if (!m_collation)
-      return def;       // Empty - not typed at all
-    return m_collation; // Explicitly typed
+
+  switch (m_type) {
+  case TYPE_EMPTY:
+    return def;
+  case TYPE_CHARACTER_SET:
+    DBUG_ASSERT(m_ci);
+    return m_ci;
+  case TYPE_COLLATE_EXACT:
+    DBUG_ASSERT(m_ci);
+    return m_ci;
+  case TYPE_COLLATE_CONTEXTUALLY_TYPED:
+    break;
    }

// Contextually typed

-  DBUG_ASSERT(m_collation);
+  DBUG_ASSERT(m_ci);

- if (m_collation == &my_charset_bin) // CHAR(10) BINARY

-    return find_bin_collation(def);
+  Charset_loader_mysys loader;
+  if (is_contextually_typed_binary_style())    // CHAR(10) BINARY
+    return loader.find_bin_collation_or_error(def);

- if (m_collation == &my_charset_latin1) // CHAR(10) COLLATE DEFAULT

-    return find_default_collation(def);
+  if (is_contextually_typed_collate_default()) // CHAR(10) COLLATE DEFAULT
+    return loader.find_default_collation(def);
+
+  const LEX_CSTRING context_name= collation_name_context_suffix();


I'd rather put this in assert, not in if(). Like


I fixed this in MDEV-27896. The patch for MDEV-27896
has this assert in a couple of places:

  DBUG_ASSERT(!strncmp(cl.charset_info()->coll_name.str,
               STRING_WITH_LEN("utf8mb4_uca1400_")))

<cut>

diff --git a/strings/ctype-uca.c b/strings/ctype-uca.c

index b89916f3b20..3e6b4e4ce43 100644
--- a/strings/ctype-uca.c
+++ b/strings/ctype-uca.c
@@ -30542,7 +30613,7 @@ static const char vietnamese[]=
    Myanmar, according to CLDR Revision 8900.
    http://unicode.org/cldr/trac/browser/trunk/common/collation/my.xml
  */
-static const char myanmar[]= "[shift-after-method expand][version 5.2.0]"
+static const char myanmar[]= "[shift-after-method expand]"


What's going on with myanmar? You removed a version here and
added &my_uca_v520 below in its charset_info_st.
What does this change mean?

  /* Tones */
  "&\\u108C"
  "<\\u1037"
@@ -37627,7 +37825,7 @@ struct charset_info_st my_charset_utf32_myanmar_uca_ci=
      NULL,               /* to_lower     */
      NULL,               /* to_upper     */
      NULL,               /* sort_order   */
-    NULL,               /* uca          */
+    &my_uca_v520,       /* uca          */


What does this change?



There are two ways to define the version:

1. Using the [version...] option in the tailoring.


2. Using the hardcoded initialization in the charset_info_st definition.


Although, built-in collations should normally use #2,
the approach #1 also worked without problems for built-in collations.
But this just assumed the tailoring is used with one UCA version only!


So I changed the old built-in myanmar collation to use #2 instead of #1.
It changes nothing for the old myanmar collations.
But the tailoring defined in "static const char myanmar[]" can
now be reused in combination with multiple UCA versions.

      NULL,               /* tab_to_uni   */
      NULL,               /* tab_from_uni */
      &my_unicase_unicode520,/* caseinfo   */

Regards,

Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx

Follow ups

Re: 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations
From: Sergei Golubchik, 2022-06-08

References

Re: 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations
From: Sergei Golubchik, 2022-03-16