maria-developers team mailing list archive

Thread
Date

Re: 026e52bad84: MDEV-27743 Remove Lex::charset

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Mon, 21 Mar 2022 10:45:07 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <YjSmHww5hsvu6gde@pweza>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

  Hello Sergei,

Thanks for the review. A new version is here:

https://github.com/MariaDB/server/tree/bb-10.9-bar-MDEV-27743

Please see commend below:


On 3/18/22 7:32 PM, Sergei Golubchik wrote:

Hi, Alexander,

Please, see below

On Mar 18, Alexander Barkov wrote:

revision-id: 026e52bad84 (mariadb-10.6.1-369-g026e52bad84)
parent(s): 369498d6b8c
author: Alexander Barkov
committer: Alexander Barkov
timestamp: 2022-03-18 12:26:03 +0400
message:

MDEV-27743 Remove Lex::charset

This patch also fixes:

MDEV-27690 Crash on `CHARACTER SET csname COLLATE DEFAULT` in column definition
MDEV-27853 Wrong data type on column `COLLATE DEFAULT` and table `COLLATE some_non_default_collation`
MDEV-28067 Multiple conflicting column COLLATE clauses are not rejected
MDEV-28118 Wrong collation of `CAST(.. AS CHAR COLLATE DEFAULT)`
MDEV-28119 Wrong column collation on MODIFY + CONVERT

diff --git a/mysql-test/main/ctype_collate_column.test b/mysql-test/main/ctype_collate_column.test
new file mode 100644
index 00000000000..6a7cf2cc84c
--- /dev/null
+++ b/mysql-test/main/ctype_collate_column.test
@@ -0,0 +1,637 @@

...

+
+DELIMITER $$;
+CREATE PROCEDURE p1(dt TEXT, cs TEXT, cl0 TEXT, cl1 TEXT, cl2 TEXT, tcs TEXT)
+BEGIN
+  DECLARE errstate TEXT DEFAULT NULL;
+  DECLARE errno INT DEFAULT NULL;
+  DECLARE errmsg TEXT DEFAULT NULL;
+  DECLARE query TEXT;
+  DECLARE result TEXT;
+  DECLARE CONTINUE HANDLER FOR SQLEXCEPTION
+  BEGIN
+    GET DIAGNOSTICS CONDITION 1 errstate = RETURNED_SQLSTATE,
+      errno = MYSQL_ERRNO, errmsg = MESSAGE_TEXT;
+  END;
+  SET query= CONCAT('CREATE TABLE t1 (a ', dt, ' ', cs, ' ', cl0,
+                           ' NOT NULL ',cl1,
+                           ' DEFAULT '''' ', cl2,
+                           ') ', tcs, ' ENGINE=Memory');
+  EXECUTE IMMEDIATE query;
+  IF errmsg IS NOT NULL
+  THEN
+    SET result=CONCAT('ERROR: ',
+                      COALESCE(errstate,'<NULL>'), ' ',
+                      COALESCE(errno,'<NULL>'), ' ',
+                      COALESCE(errmsg,'<NULL>'));
+    INSERT INTO results (dt,cs,cl0,cl1,cl2,tcs,query,result)
+                 VALUES (dt,cs,cl0,cl1,cl2,tcs,query,result);
+  ELSE
+    FOR row IN (SELECT CONCAT(COLUMN_TYPE,
+                  ' CHARACTER SET ', CHARACTER_SET_NAME,
+                  ' COLLATE ', COLLATION_NAME) AS result
+                FROM INFORMATION_SCHEMA.COLUMNS
+                WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1')
+    DO
+      INSERT INTO results (dt,cs,cl0,cl1,cl2,tcs,query,result)
+                   VALUES (dt,cs,cl0,cl1,cl2,tcs,query,row.result);
+    END FOR;
+    DROP TABLE t1;
+  END IF;
+END;
+$$
+DELIMITER ;$$
+
+
+DELIMITER $$;
+CREATE PROCEDURE p3(dt TEXT)
+BEGIN
+  FOR row IN (
+    SELECT cs, cl0, cl1.cl1 AS cl1, cl2.cl1 AS cl2, tcs
+    FROM cs, cl0, cl1, cl1 AS cl2, tcs
+    ORDER BY cs, cl0, cl1, cl2, tcs
+  )
+  DO
+    CALL p1('char(10)', row.cs, row.cl0, row.cl1, row.cl2, row.tcs);


minor detail: here ^^^ you want to use dt, not 'char(10)'



Thanks for noticing this. Fixed.

+  END FOR;
+END;
+$$
+DELIMITER ;$$
+
+
+--disable_column_names
+CALL p3('char(10)');
+--enable_column_names
+
+SELECT * FROM results_statistics;


isn't it redundant? you select all rows from result
anyway. If something will change, the result of that select will already
cause the the to fail. Why do you need second mismatch in the
result/reject.


Right, after adding the full "SELECT FROM results" outoput,
the table "results_statistics" become redundant.

I removed it.

+
+--vertical_results
+SELECT query, result, '' AS `` FROM results
+ORDER BY dt, cs, cl0, cl1, cl2, tcs;
+--horizontal_results
+
+DROP PROCEDURE p1;
+DROP PROCEDURE p3;
+
+DROP TABLE cs, cl0, cl1, tcs;

...

+# CREATE TABLE t1
+# (
+#   a CHAR(10) BINARY NOT NULL DEFAULT ''
+# ) CHARACTER SET cs COLLATE cs_bin;
+
+DELETE FROM results
+WHERE cs=''
+  AND cl0='BINARY'
+  AND cl1=''
+  AND cl2=''
+  AND tcs LIKE 'CHARACTER SET%'
+  AND result NOT LIKE 'ERROR%'
+  AND result RLIKE CONCAT('COLLATE ', tcs_character_set_name, '_bin');
+SELECT ROW_COUNT();

...

+DROP FUNCTION is_collate_clause_with_explicit_default_collation;
+DROP FUNCTION is_collate_clause_with_explicit_collation;
+DROP FUNCTION is_conflicting_charset_explicit_collate_explicit;
+DROP FUNCTION is_conflicting_collate_explicit2;
+DROP FUNCTION is_conflicting_collate_default_collate_explicit;
+DROP FUNCTION collate_cs_default_collation;


I think the main result is `select * from results` (with order by, etc).
It's great, that you added that.

Everything else is redundant, I suggest to remove other selects and
deletes.



`select * from results` returns 3k+ lines. It's too hard to check
manually that every line is correct. I am planning to add more
combinations into this tests. The number of lines will grow,
which will make manual checking ever harder.

I suggest too keep DELETEs.

The DELETE queries make sure that all lines are correct, namely:

- everything that is expected to return an error, returns an error,
  and returns the correct expected error.

- everything that is supposed to return OK, successfully creates the
  table, and uses the expected column collation.

+
+
+--echo #
+--echo # End of 10.9 tests
+--echo #
diff --git a/sql/lex_charset.h b/sql/lex_charset.h
new file mode 100644
index 00000000000..d95f0bc1920
--- /dev/null
+++ b/sql/lex_charset.h
@@ -0,0 +1,202 @@

...

+#define LEX_CHARSET_COLLATION_TYPE_BITS 2
+  static_assert(((1<<LEX_CHARSET_COLLATION_TYPE_BITS)-1) >=
+                 TYPE_COLLATE_CONTEXTUALLY_TYPED,
+                 "Lex_charset_collation_st::Type bits check");
+
+protected:
+  CHARSET_INFO *m_ci;
+  Type m_type;
+public:
+  static CHARSET_INFO *find_bin_collation(CHARSET_INFO *cs);
+  static CHARSET_INFO *find_default_collation(CHARSET_INFO *cs);
+public:

...

+  void set_contextually_typed_binary_style()
+  {
+    m_ci= &my_collation_contextually_typed_binary;
+    m_type= TYPE_COLLATE_CONTEXTUALLY_TYPED;
+  }
+  bool is_contextually_typed_collate_default() const
+  {
+    return m_ci == &my_collation_contextually_typed_default &&
+           m_type == TYPE_COLLATE_CONTEXTUALLY_TYPED;


what does it mean when only one of these two expressions is true?
Say, the first true, the second false?
Or the first false, the second true?


As you know, in the final patch for "UCA-14.0.0 collations",
UCA-14.0.0 context collations use a pointer to their utf8mb4
CHARSET_INFO versions.


For example, uca1400_as_ci uses a pointer to utf8mb4_uca1400_as_ci,
with m_type==TYPE_COLLATE_CONTEXTUALLY_TYPED indicating that
it the charset part of the collation is not really known yet and is a
subject to further resolution.


I agree that without the final patch for UCA-14.0.0, the
enum value TYPE_COLLATE_CONTEXTUALLY_TYPED is not really needed,
we could just use TYPE_COLLATE_EXACT in combination with
- my_collation_contextually_typed_binary and
- my_collation_contextually_typed_default

But I suggest to keep the code this way, withTYPE_COLLATE_CONTEXTUALLY_TYPED added, to avoid the code

changes to and fro, as we know we'll add
the final UCA-14.0.0 soon.

I can add some asserts and comments inside these
set_contextually_typed_*() and is_contextually_typed_*(),
which we'll remove after merging the final patch for UCA-14.0.0.

+  }
+  bool is_contextually_typed_binary_style() const
+  {
+    return m_ci == &my_collation_contextually_typed_binary &&
+           m_type == TYPE_COLLATE_CONTEXTUALLY_TYPED;
+  }

...

diff --git a/strings/ctype-bin.c b/strings/ctype-bin.c
index bb746ad90b0..a5c9129ebea 100644
--- a/strings/ctype-bin.c
+++ b/strings/ctype-bin.c
@@ -630,3 +630,69 @@ struct charset_info_st my_charset_bin =
      &my_charset_handler,
      &my_collation_binary_handler
  };
+
+
+struct charset_info_st my_collation_contextually_typed_binary=
+{
+    0,0,0,                        /* number        */


I thought you will never need to resove the
&my_collation_contextually_typed_binary pointer, that is, I thought the
values in the my_collation_contextually_typed_binary struct should not
matter as they'll never be used. Was that wrong?


Correct, I will never need to resolve.

I just thought that using an initialized memory over a non-initializedmemory is better.


So I'd like to avoid non-initialized variables like this:

struct charset_info_st my_collation_contextually_typed_binary;


I agree that full detailed initializing, similar to my_charset_bin,
is not really needed for these two contextual variables.
So something like this should be enough:

struct charset_info_st my_collation_contextually_typed_binary= {0};

Should I fix this way?


Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx

Follow ups

Re: 026e52bad84: MDEV-27743 Remove Lex::charset
From: Sergei Golubchik, 2022-03-21

References

Re: 026e52bad84: MDEV-27743 Remove Lex::charset
From: Sergei Golubchik, 2022-03-18