← Back to team overview

maria-developers team mailing list archive

Review #2 for: MDEV-26278 Elimination: consider GROUP BY as unique key

 

Hi Oleg,

Please find review input below. The most important is wrong-query-result
observation at the bottom.

A general comment: The term "Pseudo unique keys" is better than "synthetic
keys", but I'm concerned about how it will be read. Does "pseudo" attach to 
the "unique" (pseudo-unique?) or to the "key"? 
This sounds ambiguous to me but let's check with native speakers.

Does "Unique pseudo-key" sound better? At least it's unambigous...

> commit 75278ac03681b2cd8e6f5098b536edfa2f34a549
> Author: Oleg Smirnov <olernov@xxxxxxxxx>
> Date:   Thu Feb 17 22:53:37 2022 +0700
> 
>     MDEV-26278 Add functionality to eliminate derived tables from the query
>     
>     Elimination of unnecessary tables from SQL queries is already present
>     in MariaDB. But it only works for regular tables and not for derived ones.
>     
>     Imagine we have a view:
>       CREATE VIEW v2c AS
>       SELECT t11.a AS a, max(t12.col1) AS b
>       FROM t11 LEFT JOIN t12 ON t12.pk=t11.b
>       GROUP BY t11.a

A question: is it important that the view has an outer join in it? Or any
join? (I think not.) If not, please remove it. It's better when the example
does not have any unneeded details.

>     Due to "GROUP BY t11.a" the values of "a" are unique,
>     and this fact can be treated as like table "v2c" has a unique key
>     on field "a".
>     
>     Suppose we have a SQL query:
>       SELECT t1.* FROM t1 LEFT JOIN v2c ON t1.a=v2c.a
>     
>     Since "v2c.a" is unique, then:
>     1. "v2c" is functionally dependent on t1.
>        This means every record of "t1" will be either joined with
>        a single record from "v2c" or NULL-complemented.
>     2. No fields of "v2c" are present on the SELECT list
>     
>     These two facts allow to completely exclude (eliminate)
>     the derived table "v2c" from the query.
> 
...

> diff --git a/sql/sql_lex.h b/sql/sql_lex.h
> index 79d48528574..fb1422b7482 100644
> --- a/sql/sql_lex.h
> +++ b/sql/sql_lex.h
> @@ -1026,7 +1026,7 @@ class st_select_lex_unit: public st_select_lex_node {
>    int save_union_explain_part2(Explain_query *output);
>    unit_common_op common_op();
>  
> -  bool explainable() const;
> +  bool explainable();

Please keep the function "const". 
Yes, this will require making is_derived_eliminated() and outer_select() const,
as well, which seems easy to do.

> diff --git a/sql/opt_table_elimination.cc b/sql/opt_table_elimination.cc
> index 8c4720bdec4..7f52d4cba0b 100644
> --- a/sql/opt_table_elimination.cc
> +++ b/sql/opt_table_elimination.cc
> @@ -134,6 +136,11 @@
>     - Nodes representing unique keys. Unique key has
>        = incoming edges from key component value modules
>        = outgoing edge to key's table module
> +   - Nodes representing pseudo unique keys for derived tables.
> +     Pseudo unique keys are composed as a result of GROUP BY expressions.
> +     Like normal unique keys, they have:
> +      = incoming edges from key component value modules
> +      = outgoing edge to key's table module
>     - Inner side of outer join module. Outer join module has
>        = incoming edges from table value modules
>        = No outgoing edges. Once we reach it, we know we can eliminate the 
...
> @@ -447,6 +465,60 @@ const size_t Dep_module::iterator_size=
>    MY_MAX(Dep_module_expr::iterator_size, Dep_module_key::iterator_size);
>  
>  
> +/*
> +  A pseudo unique key module for a derived table.
> +  For example, a derived table
> +  SELECT t11.a, count(*) from t1 LEFT JOIN t2 ON t1.id = t2.fk GROUP BY t11.a
> +  has unique values in its first field (t11.a) due to GROUP BY expression
> +  so this can be considered as a unique key for this derived table
> +*/
> +
> +class Dep_module_pseudo_key : public Dep_module
> +{
> +public:
> +  Dep_module_pseudo_key(Dep_value_table *table_arg,
> +                        std::vector<field_index_t>&& field_indexes)
> +      : table(table_arg), derived_table_field_indexes(field_indexes)
> +  {
> +    unbound_args= static_cast<uint>(field_indexes.size());
> +  }
> +
> +  Dep_value_table *table;
> +
> +  Iterator init_unbound_values_iter(char *buf) override;
> +
> +  Dep_value *get_next_unbound_value(Dep_analysis_context *dac,
> +                                    Iterator iter) override;
> +
> +  bool covers_field(int field_index);
> +
> +  static const size_t iterator_size;
> +
> +private:
> +  /*
> +    Vector of field numbers (indexes) in the derived table's SELECT list
> +    which are included in the GROUP BY expression.
> +    For example, pseudo unique key for SQL
> +    "SELECT count(t12.pk) as cnt, t11.a as a FROM t11 LEFT JOIN t12
> +    ON t12.pk=t11.b GROUP BY t11.a, t12.b"
Same comment as above: does this example need outer join, or table t12 at all? 
If not, please remove it.

> +    will include one element: {1}, since "t11.a" is in the
> +    GROUP BY list and is present in the SELECT list with index 1
> +    (numeration starts from 0). Field "t12.b" is not included since
> +    though it's present in the GROUP BY list but is missing in the SELECT list
> +  */
> +  std::vector<field_index_t> derived_table_field_indexes;
> +
> +  class Value_iter
> +  {
> +  public:
> +    Dep_value_table *table;
> +  };
> +};
> +
> +const size_t Dep_module_pseudo_key::iterator_size=
> +  ALIGN_SIZE(sizeof(Dep_module_pseudo_key::Value_iter));

This doesn't seem to be used anywhere. Add it to the Dep_module::iterator_size
please:

  const size_t Dep_module::iterator_size=
    MY_MAX(Dep_module_expr::iterator_size, Dep_module_key::iterator_size);

...
> @@ -1577,33 +1654,127 @@ void add_module_expr(Dep_analysis_context *ctx, Dep_module_expr **eq_mod,
> +/*
> +  Create pseudo unique key and add it as a dependency
> +  if the query is not a UNION (ALL) and has GROUP BY expression
> +*/
> +
> +void Dep_analysis_context::create_pseudo_unique_key_if_needed(
> +    TABLE_LIST *table_list, Dep_value_table *tbl_dep)
> +{
> +  auto select_unit= table_list->get_unit();
> +  SELECT_LEX *first_select= nullptr;
> +  if (select_unit)
> +  {
> +    first_select= select_unit->first_select();
> +
> +    /*
> +      Exclude UNION (ALL) queries from consideration by checking
> +      next_select() == nullptr
> +    */
> +    if (unlikely(select_unit->first_select()->next_select()))
> +      first_select= nullptr;
> +  }
> +
> +  /*
> +    GROUP BY expression is considered as a pseudo unique key
> +    for the derived table. Add this pseudo key as a dependency
> +  */
> +  if (first_select && first_select->group_list.elements > 0)
> +  {
> +    /*
> +      Make sure GROUP BY elements contain only fields
> +     and no functions or other expressions
> +    */
> +    bool valid= TRUE;
> +    std::vector<field_index_t> exposed_fields_indexes;
> +    for (auto cur_group= first_select->group_list.first;
> +         cur_group;
> +         cur_group= cur_group->next)
> +    {
> +      auto elem= *(cur_group->item);
> +      if (elem->type() != Item::FIELD_ITEM)
> +      {
> +        valid= FALSE;
> +        break;
> +      }
> +      auto field_idx= find_field_in_list(first_select->join->fields_list, elem);
> +      if (field_idx != -1)
> +      {
> +        /*
> +          The field is found in the SELECT list, so it is exposed
> +          outside the derived table
> +        */
> +        exposed_fields_indexes.push_back(field_idx);
> +      }
I don't see where this function checks that all GROUP BY elements are in the
select list.

If one uses only a subset, for example "SELECT a FROM t ... GROUP BY a,b" then
the value of "a" is not unique.
Example:

create table t2 (a int, b int, c int);
insert into t2 select A.seq, B.seq, 123 from seq_1_to_3 A, seq_1_to_3 B;

explain 
select t1.* 
from
  t1 left join (select a, count(*) as cnt from t2 group by a,b) T
      on T.a=t1.a;

select t1.* 
from
  t1 left join (select a, count(*) as cnt from t2 group by a,b) T
      on T.a=t1.a;
(6 rows)

set optimizer_switch='table_elimination=off';
select t1.* 
from
  t1 left join (select a, count(*) as cnt from t2 group by a,b) T
      on T.a=t1.a;
(2 rows)

I'm surprised that testcases didn't hit this...

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net




References