← Back to team overview

maria-developers team mailing list archive

Re: On constructing ref access from mis-matched charset comparisons

 

Hi Sergey,


On 02/24/2017 04:23 AM, Sergey Petrunia wrote:
> Hi Alexander,
> 
> I've got a question about mis-matched charset comparisons and ref access method.
> 
> == Short form ==
> I know that VARCHAR comparisons over mis-matching charsets cannot be used for 
> constructing index lookups into the index over the "narrower" character set
> column. 
> But is this a real limitation or just an optimizer deficiency?
> 
> If it is the latter, do people hit it in the real world sufficiently often to
> warrant lifting it?
> 
> == Long form ==
> 
> === Example dataset ===
> 
> create table t0 (
>   a varchar(32) character set utf8 collate utf8_general_ci
> );
> 
> create table t1 (
>   a varchar(32) character set latin1 collate latin1_swedish_ci,
>   col1 varchar(32),
>   key(a)
> );
> 
> create table t2 (
>   a varchar(32) character set utf8 collate utf8_general_ci,
>   col1 varchar(32),
>   key(a)
> );
> 
> insert into t0 select a from ten;
> insert into t1 select a,a from one_k;
> insert into t2 select a,a from one_k;
> 
> === Queries ===
> 
> MariaDB [test]> explain 
>     -> select * from t0, t2 where t0.a=t2.a;
> +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
> | id   | select_type | table | type | possible_keys | key  | key_len | ref       | rows | Extra       |
> +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
> |    1 | SIMPLE      | t0    | ALL  | NULL          | NULL | NULL    | NULL      |   10 | Using where |
> |    1 | SIMPLE      | t2    | ref  | a             | a    | 99      | test.t0.a |    1 |             |
> +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
> 
> Ok, good.
> 
> MariaDB [test]> explain 
>     -> select * from t0, t1 where t0.a=t1.a;
> +------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
> | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                           |
> +------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
> |    1 | SIMPLE      | t0    | ALL  | NULL          | NULL | NULL    | NULL |   10 |                                                 |
> |    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using where; Using join buffer (flat, BNL join) |
> +------+-------------+-------+------+---------------+------+---------+------+------+-------------------------------------------------+
> 2 rows in set (0.00 sec)
> 
> Oops, the index t1(a) is not used. FORMAT=JSON shows the reason:
> 
>       "attached_condition": "(t0.a = convert(t1.a using utf8))"
> 
> Still it feels like the optimizer could
> 
> * Try to convert the value of t0.a into latin1.  
> * If it can't be represented in latin1, then we know that no row
>   in t1 has t1.a=t0.a.

This is correct, if it cannot be represented in latin1, then there are
no rows that match the equality.

> * make a lookup on t1.a= convert(t0.a using latin1) , using t1.a's collation
>   rules.

This is not generally correct, because you propose to compare
using the collation of t1.a instead of the collation of t0.a,
and the two collations can have different rules.



> 
> Is this true [for some charsets] ?

It can work in some cases:

If the collation of t0.a is binary (e.g. utf8_bin),
then it can be done. But index lookup on t1.a will return
some false positives, so you'll need to filter them
out after index lookup. So it will involve two
character set conversions:

- Lookup on t1.a= convert(t0.a using latin1)
  It will return some values.
  Let's call them T0A0, T0A1, T0A2.
  The number of distinct values will depend on the t1.a's collation
  style (e.g. case and accent sensitivity).


- Test the found values using the original condition:
   t0.a=convert(T0A0 using utf8)
   t0.a=convert(T0A1 using utf8)
   t0.a=convert(T0A2 using utf8)
  and choose the one that matches
  (In case if t0.a has a _bin collation,
   not more than one value can match).


Note, the same trick can actually work for some non-binary collations
as well, but I need some time to formalize the rules when it's possible.

> 
> If yes, is it worth doing?

Index lookup is better than a full table scan.
So if it does not take too much implementation efforts,
then I'd do this.


> 
> BR
>  Sergei
> 


References