maria-developers team mailing list archive

Thread
Date

Re: 940d028521f: MDEV-30164 System variable for default collations

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Tue, 21 Mar 2023 14:51:34 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ZAsqKtzd5HT8WA8b@pweza>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

Hello Sergei,


Please review a new patch version:

https://github.com/MariaDB/server/commit/8d51c6d234b1730d4ff3b2c1fe7828eeca81998b


Comments go inline:

As we discussed on slack, my plan was not to change the hard-coded
default collations in the collation library.
So in the library it will always be utf8mb4_general_ci as a default
collation for utf8mb4. And the server default collations will be
handled by this new variable character_set_collations.
I did not mean this variable to be something fringe or temporary.


I see. Then collations set as default this way must be undistinguishable
in behavior from compiled-in defaults.
In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should
show them as default.


Done.


And performance considerations become more important, I wasn't paying
much attention to that, as I thought it's a fringe feature.

Like, a linear search in Charset_collation_map_st could be replaced with
a binary. A HASH might be too much overhead, but a binary search in the
array could be worth it. As it's a search that every single user will be
using in the future, perhaps even in every single query. Or even many
times per query.


Done. It now uses binary search.


+  LEX_CSTRING character_set_collations_str;


better LEX_CUSTRING, don't you think? It's not even a _str.


Right, it's not _str. It's a binary data. Removed the suffix.

What are benefits of LEX_CUSTRING?


LEX_CSTRING is a string, char*.

LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate
here.


I tried LEX_CUSTRING. It required a little bit more code than
the current version. I propose to keep it LEX_CSTRING.

diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -1194,6 +1194,14 @@ bool Query_log_event::write()
       int2store(start+2, auto_increment_offset);
       start+= 4;
     }
+
+  if (thd && thd->variables.character_set_collations.count())
+  {
+    *start++= Q_COLLATIONS_SESSION;
+    size_t len= thd->variables.character_set_collations.to_binary((char*)start);
+    start+= len;
+  }


Perhaps, detect if it's needed? A cheap way of doing it would be to
extend your Elem_st with a query_id. And every time you
find_elem_by_charset, you set this elem's query_id to thd->query_id.
And here you write only elements with the current query id. If any.


As agreed on Slack, I did it in the following way:

The map is now written to binlog only if the statement
performed a "get default collation for a character set" resolution.

Most DML statements do not do such resolution, so this
approach is very good to reduce the number of
map writes. And it's very cheap and simple: a new flag in THD::used.

Additionally, as discussed on slack, I added automatic PS re-prepare
if @@character_set_collations has changed between PREPARE and EXECUTE.


Thanks.

References

Re: 940d028521f: MDEV-30164 System variable for default collations
From: Sergei Golubchik, 2023-03-07
Re: 940d028521f: MDEV-30164 System variable for default collations
From: Alexander Barkov, 2023-03-10
Re: 940d028521f: MDEV-30164 System variable for default collations
From: Sergei Golubchik, 2023-03-10