← Back to team overview

maria-developers team mailing list archive

On constructing ref access from mis-matched charset comparisons

 

Hi Alexander,

I've got a question about mis-matched charset comparisons and ref access method.

== Short form ==
I know that VARCHAR comparisons over mis-matching charsets cannot be used for 
constructing index lookups into the index over the "narrower" character set
column. 
But is this a real limitation or just an optimizer deficiency?

If it is the latter, do people hit it in the real world sufficiently often to
warrant lifting it?

== Long form ==

=== Example dataset ===

create table t0 (
  a varchar(32) character set utf8 collate utf8_general_ci
);

create table t1 (
  a varchar(32) character set latin1 collate latin1_swedish_ci,
  col1 varchar(32),
  key(a)
);

create table t2 (
  a varchar(32) character set utf8 collate utf8_general_ci,
  col1 varchar(32),
  key(a)
);

insert into t0 select a from ten;
insert into t1 select a,a from one_k;
insert into t2 select a,a from one_k;

=== Queries ===

MariaDB [test]> explain 
    -> select * from t0, t2 where t0.a=t2.a;
+------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref       | rows | Extra       |
+------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
|    1 | SIMPLE      | t0    | ALL  | NULL          | NULL | NULL    | NULL      |   10 | Using where |
|    1 | SIMPLE      | t2    | ref  | a             | a    | 99      | test.t0.a |    1 |             |
+------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+

Ok, good.

MariaDB [test]> explain 
    -> select * from t0, t1 where t0.a=t1.a;
+------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                           |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
|    1 | SIMPLE      | t0    | ALL  | NULL          | NULL | NULL    | NULL |   10 |                                                 |
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using where; Using join buffer (flat, BNL join) |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
2 rows in set (0.00 sec)

Oops, the index t1(a) is not used. FORMAT=JSON shows the reason:

      "attached_condition": "(t0.a = convert(t1.a using utf8))"

Still it feels like the optimizer could

* Try to convert the value of t0.a into latin1.  
* If it can't be represented in latin1, then we know that no row
  in t1 has t1.a=t0.a.
* make a lookup on t1.a= convert(t0.a using latin1) , using t1.a's collation
  rules.

Is this true [for some charsets] ?

If yes, is it worth doing?

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog




Follow ups