← Back to team overview

maria-developers team mailing list archive

Re: Review input for: MDEV-22534 Decorrelate IN subquery

 

Hi Sergei,

Thanks again for your comments.

I have added a comment on the jira ticket including the link to a patch
that addresses your comments. I will take one more look at it tomorrow
morning before sending it to you for review, but I thought maybe it
makes sense to respond now (see below) given the timezone difference.

> Hi Yuchen,
> 
> Please find below review input for 
> 
> > commit ba9cf8efeb0b1e1c132d565adb79c98156783f15
> > Author: Yuchen Pei <yuchen.pei@xxxxxxxxxxx>
> > Date:   Fri May 19 20:06:31 2023 +1000
> >
> >     MDEV-22534 Decorrelate IN subquery
> >     
> 
> First, general questions:
> 
> Q1: It seems there is some overlap between
> Item_in_subselect::exists2in_processor() and 
> Item_exists_subselect::exists2in_processor() Did you consider factoring
> out common code rather than copying it? (if yes, I'd like to know why
> it couldn't be done).

Done. I factored out the common code into three methods:

- Item_exists_subselect::exists2in_prepare()
- Item_exists_subselect::exists2in_create_or_update_in()
- Item_exists_subselect::exists2in_and_is_not_nulls()

In doing so I tried to not alter the logic of the exists2in
transformation, while keeping the decorrelate-in transformation
simple. For things done in exists2in but not in my decorrelate-in
implementation, I add a substype() check.

> Q2: Where is the coe that removes the correlated equalities from the 
> subquery's WHERE? (I assume Item_exists_subselect code does it?)

Naively, the best place to remove the equalities would be in
find_inner_outer_equalities(), when it iterates over the conds and one
could simply call the remove() method on the iterator. But we can't just
do that because after the call to find_inner_outer_equalities() we need
to do more checks, as well as replacing the equalities with `IS NOT
NULL`s in case of upper_not to ensure the 3vl value of the subquery
remains the same (see my previous response about three value logic).

> > diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc
> > index 9e6c205ca76..8d893794da3 100644
> > --- a/sql/item_subselect.cc
> > +++ b/sql/item_subselect.cc
> > @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds,
> >    return TRUE;
> >  }
> >  
> > +/* Checks whether item tree intersects with the free list */
> > +static bool intersects_free_list(Item *item, THD *thd)
> > +{
> > +  for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next)
> > +    if (item->walk(&Item::find_item_processor, 1, (void *) to_find))
> > +      return true;
> > +  return false;
> > +}
> > +
> 
> Please try moving this function into a separate file. First name that came to
> my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea
> is that we should aim for better structure.
> (This request is only valid if the factoring out suggested above doesn't
> work out).

Replied previously. BTW why should this function be in a separate file?
Is it because it is not a method of any of the Item_subselect classes?
 
> > +/* De-correlate where conditions in an IN subquery */
> 
> This needs a bigger comment describing what the rewrite does.

Done and replied previously

> > +bool Item_in_subselect::exists2in_processor(void *opt_arg)
> > +{
> > +  THD *thd= (THD *)opt_arg;
> > +  SELECT_LEX *first_select= unit->first_select();
> > +  JOIN *join= first_select->join;
> > +  bool will_be_correlated;
> > +  Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5);
> > +  List<Item> outer;
> > +  Query_arena *arena= NULL, backup;
> > +  int res= FALSE;
> > +  DBUG_ENTER("Item_in_subselect::exists2in_processor");
> > +
> 
> Please move comments out of the if statement and number the conditions.
> 
> For examples, see best_access_path(), large if-statement starting with
>   "Don't test table scan if it can't be better."
> or 
>   "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause"
> in Item_exists_subselect::exists2in_processor().

Done and replied previously

> > +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
> > +      /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
> > +      !(is_top_level_item() ||
> > +        (upper_not && upper_not->is_top_level_item())) ||
> 
> what is the reason behind this condition?
> If we're able to handle top-level NOT IN, why can't we handle subquery in any
> context?
> 
> Can we really handle the NOT IN? (I suspect not, it will suffer from the
> semantics of IN subqueries with NULLs.. 
> I can't find the name for this, it's described e.g. here:
> https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )

Done and replied previously

> 
> > +      first_select->is_part_of_union() ||  /* skip if part of a union */
> > +      first_select->group_list.elements || /* skip if with group by */
> > +      first_select->with_sum_func ||       /* skip if aggregation */
> > +      join->having ||                      /* skip if with having */
> > +      !join->conds ||                      /* skip if no conds */
> > +      /* skip if left expr is a single nonscalar subselect */
> > +      (left_expr->type() == Item::SUBSELECT_ITEM &&
> > +       !left_expr->type_handler()->is_scalar_type()))
> 
> The above seems to be some exotic SQL. Is it something like 
>   
>   (select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)
> 
> and does MariaDB really support this? Please add a comment about
> this.

Done and replied previously

> 
> > +    DBUG_RETURN(FALSE);
> > +  /* iterate over conditions, and check whether they can be moved out. */
> > +  if (find_inner_outer_equalities(&join->conds, eqs))
> > +    DBUG_RETURN(FALSE);
> > +  /* If we are in the execution of a prepared statement, check for
> > +  intersection with the free list to avoid segfault. Note that the
> > +  check for prepared statement execution is necessary, otherwise it
> > +  will likely always find intersection thus abort the
> > +  transformation.
> > +
> > +  fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is
> > +  defined. It causes CI's like amd64-debian-10-debug-embedded to
> > +  fail. Are there other ways to find out we are in the execution of a
> > +  prepared statement? */
> 
> I'm looking at the code of activate_stmt_arena_if_needed(). It seems
> "stmt_arena->is_conventional()" is the check you're looking for.

Done and replied previously

> 
> > +  if (thd->m_statement_state.m_parent_prepared_stmt)
> > +  {
> > +    for (uint i= 0; i < (uint) eqs.elements(); i++)
> > +    {
> > +      if (intersects_free_list(*eqs.at(i).eq_ref, thd))
> > +        DBUG_RETURN(FALSE);
> 
> This check doesn't look good. 
> As far as I understand, it is a temporary measure until related
> re-execution bug(s) are fixed by Igor (and/or Sanja)?

Replied previously

> > +    }
> > +  }
> 
> > +  /* Determines whether the result will be correlated */
> > +  {
> > +    List<Item> unused;
> > +    Collect_deps_prm prm= {&unused,          // parameters
> > +      unit->first_select()->nest_level_base, // nest_level_base
> > +      0,                                     // count
> > +      unit->first_select()->nest_level,      // nest_level
> > +      FALSE                                  // collect
> > +    };
> > +    walk(&Item::collect_outer_ref_processor, TRUE, &prm);
> > +    DBUG_ASSERT(prm.count > 0);
> > +    DBUG_ASSERT(prm.count >= (uint)eqs.elements());
> > +    will_be_correlated= prm.count > (uint)eqs.elements();
> See the example pasted in the MDEV:
> https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259705&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259705
> 
> Somehow, I get here will_be_correlated==FALSE but Materialization is still not
> considered for the subquery?

Done and replied in the ticket

> 
> > +  }
> > +  /* Back up the free list */
> > +  arena= thd->activate_stmt_arena_if_needed(&backup);
> > +  /* Construct the items for left_expr */
> > +  if (left_expr->type() == Item::ROW_ITEM)
> > +    for (uint i= 0; i < left_expr->cols(); i++)
> > +      outer.push_back(left_expr->element_index(i));
> > +  else
> > +    outer.push_back(left_expr);
> > +  const uint offset= first_select->item_list.elements;
> > +  /* Move items to outer and select item list */
> > +  for (uint i= 0; i < (uint)eqs.elements(); i++)
> > +  {
> > +    Item **eq_ref= eqs.at(i).eq_ref;
> > +    Item_ident *local_field= eqs.at(i).local_field;
> > +    Item *outer_exp= eqs.at(i).outer_exp;
> > +    first_select->item_list.push_back(local_field, thd->mem_root);
> > +    first_select->ref_pointer_array[offset + i]= (Item *)local_field;
> 
> How do we know that ref_pointer_array has enough room for the 
> new elements? 
> (We can discuss this with Sanja, he's an expert on the topic).
> 
> I see this check in Item_exists_subselect::exists2in_processor
> 
>     if ((uint)eqs.elements() > (first_select->item_list.elements +
>                                 first_select->select_n_reserved))
> 
> Is it how Item_exists_subselect handles this issue?

Yes, I think so, done

Best,
Yuchen


References