← Back to team overview

maria-developers team mailing list archive

Re: 940d028521f: MDEV-30164 System variable for default collations

 

Hi, Alexander,

On Mar 10, Alexander Barkov wrote:
> On 3/7/23 2:20 PM, Sergei Golubchik wrote:
> > I'm sorry it took a while.
> >
> > I'm still thinking about the whole thing, it's a rather big change
> > for a really fringe functionality. But I failed to come up with
> > something better so far.
> >
> > Code-wise the patch is mostly fine. See few small comments below,
> > and one slightly larger comment re. replication.
>
> As we discussed on slack, my plan was not to change the hard-coded
> default collations in the collation library.
> So in the library it will always be utf8mb4_general_ci as a default
> collation for utf8mb4. And the server default collations will be
> handled by this new variable character_set_collations.
> I did not mean this variable to be something fringe or temporary.

I see. Then collations set as default this way must be undistinguishable
in behavior from compiled-in defaults.
In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should
show them as default.

And performance considerations become more important, I wasn't paying
much attention to that, as I thought it's a fringe feature.

Like, a linear search in Charset_collation_map_st could be replaced with
a binary. A HASH might be too much overhead, but a binary search in the
array could be worth it. As it's a search that every single user will be
using in the future, perhaps even in every single query. Or even many
times per query.

> > On Mar 07, Alexander Barkov wrote:
> >> revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f)
> >> parent(s): 71dedd0ebcd
> >> author: Alexander Barkov
> >> committer: Alexander Barkov
> >> timestamp: 2022-12-14 20:00:22 +0400
> >> message:
> >>
> >> MDEV-30164 System variable for default collations
> >>
> >> This patch adds a way to override default collations
> >> (or "character set collations") for desired character sets.
> >>
> >> diff --git a/sql/lex_charset.h b/sql/lex_charset.h
> >> --- a/sql/lex_charset.h
> >> +++ b/sql/lex_charset.h
>
> >> +    case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT:
> >> +    case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:
> >
> > COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED.
> > shouldn't it use the map too?
>
> This method charset_info() is used only in one place, in this method of
> class Column_definition:
...

Thanks for the explanation!

> >> +    case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT:
> >> +      break;
> >> +    }
> >> +    return m_ci;
> >> +  }
> >>     Type type() const
> >>     {
> >>       return m_type;
> >> diff --git a/sql/log_event.h b/sql/log_event.h
> >> --- a/sql/log_event.h
> >> +++ b/sql/log_event.h
> >> @@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event
> >>     bool sql_mode_inited;
> >>     bool charset_inited;
> >>
> >> +  LEX_CSTRING character_set_collations_str;
> >
> > better LEX_CUSTRING, don't you think? It's not even a _str.
>
> Right, it's not _str. It's a binary data. Removed the suffix.
>
> What are benefits of LEX_CUSTRING?

LEX_CSTRING is a string, char*.

LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate
here.

> >> diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
> >> --- a/sql/log_event_server.cc
> >> +++ b/sql/log_event_server.cc
> >> @@ -1194,6 +1194,14 @@ bool Query_log_event::write()
> >>       int2store(start+2, auto_increment_offset);
> >>       start+= 4;
> >>     }
> >> +
> >> +  if (thd && thd->variables.character_set_collations.count())
> >> +  {
> >> +    *start++= Q_COLLATIONS_SESSION;
> >> +    size_t len= thd->variables.character_set_collations.to_binary((char*)start);
> >> +    start+= len;
> >> +  }
> >
> > Perhaps, detect if it's needed? A cheap way of doing it would be to
> > extend your Elem_st with a query_id. And every time you
> > find_elem_by_charset, you set this elem's query_id to thd->query_id.
> > And here you write only elements with the current query id. If any.
> >
> > Another approach would be to have a bitmap, like
> >
> >    uchar used_default_coll_mapping;
> >
> > and in find_elem_by_charset() you set the bit, like
> >
> >    used_default_coll_mapping |= 1 << i;
> >
> > and then, again, print affected collations, if any. Most often
> > used_default_coll_mapping will likely be zero
>
> This will need additional resources even if the server is not in
> replication at all.

8 bytes per collation in the map or one byte per thd?
That's not noticeable.

> But on the other hand, when the server is in replication,
> the benefit on the slave side is not so obvious.

The benefit is that it'll log only the affected collations (typically
zero or one) not the whole map in every event.

> I propose a different optimization. It's related to the question
> below:
>
> > one more question. In, say, 10.10->11.1 replication
> > master and slave will have different default collations, but
> > thd->variables.character_set_collations will not reflect that.
> > How do you plan to solve it?
> >
> >> +
> >>     if (charset_inited)
> >>     {
> >>       *start++= Q_CHARSET_CODE;
>
> Every major release will have its own default value for
> character_set_collations.
>
> For example, in 11.1 we plan to change it to:
>
> utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf16le=uca1400_ai_ci,utf32=uca1400_ai_ci
>
> while in 10.11 it will be empty by default.
>
> I propose to add server version into the binlog event header.
> (by the way it will be useful for many other purposes).

It's there already. see log_event.cc for
Format_description_log_event::server_version_split and class Version.

> If we know the master version on the slave side, we
> can optimize writing of the Q_COLLATION_SESSION chunk
> when character_set_collations is equal to its default
> value (according to the master version):

I don't know how it can work, if you replicate from 11.1 to
an old slave that knows how to read Q_COLLATION_SESSION, but doesn't
know what hard-coded defaults are in some later MariaDB version.

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx


Follow ups

References