maria-developers team mailing list archive

Thread
Date

Re: 940d028521f: MDEV-30164 System variable for default collations

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Fri, 10 Mar 2023 12:09:05 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ZAcQCbmDIvJJK7yH@pweza>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

Hello Sergei,

Thanks for the review. Please see comments inline:


On 3/7/23 2:20 PM, Sergei Golubchik wrote:

Hi, Alexander,

I'm sorry it took a while.

I'm still thinking about the whole thing, it's a rather big change for
a really fringe functionality. But I failed to come up with something
better so far.

Code-wise the patch is mostly fine. See few small comments below, and
one slightly larger comment re. replication.


As we discussed on slack, my plan was not to change the hard-coded
default collations in the collation library.
So in the library it will always be utf8mb4_general_ci as a default
collation for utf8mb4. And the server default collations will be
handled by this new variable character_set_collations.
I did not mean this variable to be something fringe or temporary.


On Mar 07, Alexander Barkov wrote:

revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f)
parent(s): 71dedd0ebcd
author: Alexander Barkov
committer: Alexander Barkov
timestamp: 2022-12-14 20:00:22 +0400
message:

MDEV-30164 System variable for default collations

This patch adds a way to override default collations
(or "character set collations") for desired character sets.

diff --git a/sql/lex_charset.h b/sql/lex_charset.h
--- a/sql/lex_charset.h
+++ b/sql/lex_charset.h
@@ -544,6 +699,20 @@ struct Lex_exact_charset_extended_collation_attrs_st
    {
      return m_ci;
    }
+  CHARSET_INFO *charset_info(const Charset_collation_map_st &map) const
+  {
+    switch (m_type)
+    {
+    case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET:
+      return map.get_collation_for_charset(m_ci);
+    case TYPE_EMPTY:


Lex_exact_charset_extended_collation_attrs_st::TYPE_EMPTY
(or all case labels without a struct name)


Correct, the qualifiers were redundant here. Removing.

+    case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT:
+    case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:


COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED.
shouldn't it use the map too?



This method charset_info() is used only in one place, in this method of
class Column_definition:

void set_charset_collation_attrs(const Charset_collation_map_st &map,const


Lex_column_charset_collation_attrs_st &lc)
  {
    charset= lc.charset_info2(map);
    if (lc.is_contextually_typed_collation())
      flags|= CONTEXT_COLLATION_FLAG;
    else
      flags&= ~CONTEXT_COLLATION_FLAG;
  }


When this method is called from the parser, the character set
of the column may not be know yet:

CREATE TABLE t1 (a COLLATE DEFAULT /*HERE*/ ) CHARACTER SET utf8mb4;

Notice, the character set comes later, from the table level.


So in the method charset_info() we need to return the
verbatim value of Lex_exact_charset_extended_collation_attrs_st::m_ci
for all types (except TYPE_CHARACTER_SET, see below).


The returned value is copied to Column_definition::charset,
and the CONTEXT_COLLATION_FLAG bit is set or unset
in Column_definition::flags according to "context-ness"
of Lex_exact_charset_extended_collation_attrs_st.

The resolution of COLLATE DEFAULT happens later,
after parsing the whole CREATE TABLE statement,
when we have the information about the table level character set.



But if Lex_exact_charset_extended_collation_attrs_st::m_type
is equal to TYPE_CHARACTER_SET, we can find the current default
collation for the character set right here,
inside Lex_exact_charset_extended_collation_attrs_st::charset_info().
There is no a need to postpone the resolution.


And by the way, there is no even a way to postpone this resolution,
because Column_definition does not distinguish between:

a. Column_definition::charset is CHARACTER SET
b. Column_definition::charset is CHARACTER SET with COLLATE

The former needs to take into account @@character_set_collations,
the latter does not need.

+    case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT:
+      break;
+    }
+    return m_ci;
+  }
    Type type() const
    {
      return m_type;
diff --git a/sql/log_event.h b/sql/log_event.h
--- a/sql/log_event.h
+++ b/sql/log_event.h
@@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event
    bool sql_mode_inited;
    bool charset_inited;

+ LEX_CSTRING character_set_collations_str;


better LEX_CUSTRING, don't you think? It's not even a _str.


Right, it's not _str. It's a binary data. Removed the suffix.

What are benefits of LEX_CUSTRING?

I tried to change the data type,
it didn't make the code better.

+
    uint32 flags2;
    sql_mode_t sql_mode;
    ulong auto_increment_increment, auto_increment_offset;
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -1194,6 +1194,14 @@ bool Query_log_event::write()
      int2store(start+2, auto_increment_offset);
      start+= 4;
    }
+
+  if (thd && thd->variables.character_set_collations.count())
+  {
+    *start++= Q_COLLATIONS_SESSION;
+    size_t len= thd->variables.character_set_collations.to_binary((char*)start);
+    start+= len;
+  }


Perhaps, detect if it's needed? A cheap way of doing it would be to extend
your Elem_st with a query_id. And every time you find_elem_by_charset,
you set this elem's query_id to thd->query_id. And here you write only
elements with the current query id. If any.

Another approach would be to have a bitmap, like

   uchar used_default_coll_mapping;

and in find_elem_by_charset() you set the bit, like

   used_default_coll_mapping |= 1 << i;

and then, again, print affected collations, if any. Most often
used_default_coll_mapping will likely be zero

This will need additional resources even if the server is not inreplication at all.


But on the other hand, when the server is in replication,
the benefit on the slave side is not so obvious.

I propose a different optimization. It's related to the question below:


one more question. In, say, 10.10->11.1 replication
master and slave will have different default collations, but
thd->variables.character_set_collations will not reflect that.
How do you plan to solve it?

+
    if (charset_inited)
    {
      *start++= Q_CHARSET_CODE;

Every major release will have its own default value forcharacter_set_collations.


For example, in 11.1 we plan to change it to:

utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf16le=uca1400_ai_ci,utf32=uca1400_ai_ci

while in 10.11 it will be empty by default.


I propose to add server version into the binlog event header.
(by the way it will be useful for many other purposes).


If we know the master version on the slave side, we
can optimize writing of the Q_COLLATION_SESSION chunk
when character_set_collations is equal to its default
value (according to the master version):

- We can either omit writing Q_COLLATIONS_SESSION completely

- Or at least write a very short version (e.g. with count==0)

which will mean "use the default value of character_set_collations
corresponding to my version".


Oh, by the way I need to rename Q_COLLATIONS_SESSION to
Q_CHARACTER_SET_COLLATIONS.

Regards,

Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx

Follow ups

Re: 940d028521f: MDEV-30164 System variable for default collations
From: Sergei Golubchik, 2023-03-10

References

Re: 940d028521f: MDEV-30164 System variable for default collations
From: Sergei Golubchik, 2023-03-07