maria-developers team mailing list archive

Thread
Date
Re: Review input for: MDEV-22534 Decorrelate IN subquery

To: Sergey Petrunia <sergey@xxxxxxxxxxx>
From: Yuchen Pei <yuchen.pei@xxxxxxxxxxx>
Date: Fri, 26 May 2023 18:02:00 +1000
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ZG+zhd75pY9ypt+y@psp15s> (Sergey Petrunia's message of "Thu, 25 May 2023 22:14:13 +0300")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Hi Sergey,

Thanks for your comments. Please find some quick responses to some of
the comments while they are still fresh in my mind before the weekend
comes, and more to follow next week.
On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:

> Hi Yuchen,
>
> Please find below review input for 
>
>> commit ba9cf8efeb0b1e1c132d565adb79c98156783f15
>> Author: Yuchen Pei <yuchen.pei@xxxxxxxxxxx>
>> Date:   Fri May 19 20:06:31 2023 +1000
>>
>>     MDEV-22534 Decorrelate IN subquery
>>     
>
> First, general questions:
>
> Q1: It seems there is some overlap between
> Item_in_subselect::exists2in_processor() and 
> Item_exists_subselect::exists2in_processor() Did you consider factoring
> out common code rather than copying it? (if yes, I'd like to know why
> it couldn't be done).

Indeed Item_in_subselect::exists2in_processor() is modeled after
Item_exists_subselect::exists2in_processor(). The issue with the latter
is that it is a large function that could have grown over time and I
want to build Item_in_subselect::exists2in_processor() with only what is
necessary. I will try to factor out common code. Also please take a look
at the added test file in the patch - I tried to come up with tests for
every line, including the skip condition involving upper_not mentioned
below.

>
> Q2: Where is the coe that removes the correlated equalities from the 
> subquery's WHERE? (I assume Item_exists_subselect code does it?)

Like the processor for `exists`, I replaces the correlated equalities in
`where` with int item 1 rather than removing them. Is it better to
remove them?

>
>> diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc
>> index 9e6c205ca76..8d893794da3 100644
>> --- a/sql/item_subselect.cc
>> +++ b/sql/item_subselect.cc
>> @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds,
>>    return TRUE;
>>  }
>>  
>> +/* Checks whether item tree intersects with the free list */
>> +static bool intersects_free_list(Item *item, THD *thd)
>> +{
>> +  for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next)
>> +    if (item->walk(&Item::find_item_processor, 1, (void *) to_find))
>> +      return true;
>> +  return false;
>> +}
>> +
>
> Please try moving this function into a separate file. First name that came to
> my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea
> is that we should aim for better structure.
> (This request is only valid if the factoring out suggested above doesn't
> work out).

Sanja commented on a similar commit to MDEV-31269 that this way of
finding intersection is too costly. What do you think? Before moving
this function, either we need to agree this is an acceptable function or
figure out an acceptable way of finding free items in the item tree from
the equalities. Failing both, then I suggest (as I did in a comment[1]
on MDEV-22534), we mark this function as out of scope for the review of
MDEV-22534 and as part of MDEV-31269 which may be fixed by MDEV-30073
(see a comment[2] in MDEV-31269).

[1]
https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259406
[2]
https://jira.mariadb.org/browse/MDEV-31269?focusedCommentId=259516&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259516

>
>> +/* De-correlate where conditions in an IN subquery */
>
> This needs a bigger comment describing what the rewrite does.
>
>> +bool Item_in_subselect::exists2in_processor(void *opt_arg)
>> +{
>> +  THD *thd= (THD *)opt_arg;
>> +  SELECT_LEX *first_select= unit->first_select();
>> +  JOIN *join= first_select->join;
>> +  bool will_be_correlated;
>> +  Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5);
>> +  List<Item> outer;
>> +  Query_arena *arena= NULL, backup;
>> +  int res= FALSE;
>> +  DBUG_ENTER("Item_in_subselect::exists2in_processor");
>> +
>
> Please move comments out of the if statement and number the conditions.
>
> For examples, see best_access_path(), large if-statement starting with
>   "Don't test table scan if it can't be better."
> or 
>   "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause"
> in Item_exists_subselect::exists2in_processor().
>
>
>> +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
>> +      /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
>> +      !(is_top_level_item() ||
>> +        (upper_not && upper_not->is_top_level_item())) ||
>
> what is the reason behind this condition?

Without these conditions, the following testcase in
`subslect_decorrelate_in.test` will fail (produce wrong results IIRC):

--8<---------------cut here---------------start------------->8---
--echo # skip transformation when neither toplevel IN nor toplevel NOT IN; testcase adapated from main.subslect4
CREATE TABLE t1 ( a INT, b INT );
INSERT INTO t1 VALUES ( 1, NULL ), ( 2, NULL );
CREATE TABLE t2 ( c INT, d INT );
INSERT INTO t2 VALUES ( NULL, 3 ), ( NULL, 4 );
# not top level, upper_not is not top level
SELECT * FROM t1 WHERE a NOT IN ( SELECT c FROM t2 where b = d ) IS NULL;
select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE;
# not top level, upper_not is null
SELECT * FROM t1 WHERE a IN ( SELECT c FROM t2 where b = d ) IS NULL;
select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE;
# not top level, upper_not is not top level
SELECT * FROM t1 WHERE a NOT IN ( SELECT c FROM t2 where b = d) IS unknown;
select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE;
# not top level, upper_not is null
SELECT * FROM t1 WHERE a IN ( SELECT c FROM t2 where b = d ) IS unknown;
select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE;

drop table t1, t2;
--8<---------------cut here---------------end--------------->8---


> If we're able to handle top-level NOT IN, why can't we handle subquery in any
> context?
>
> Can we really handle the NOT IN? (I suspect not, it will suffer from the
> semantics of IN subqueries with NULLs.. 
> I can't find the name for this, it's described e.g. here:
> https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
>
>> +      first_select->is_part_of_union() ||  /* skip if part of a union */
>> +      first_select->group_list.elements || /* skip if with group by */
>> +      first_select->with_sum_func ||       /* skip if aggregation */
>> +      join->having ||                      /* skip if with having */
>> +      !join->conds ||                      /* skip if no conds */
>> +      /* skip if left expr is a single nonscalar subselect */
>> +      (left_expr->type() == Item::SUBSELECT_ITEM &&
>> +       !left_expr->type_handler()->is_scalar_type()))
>
> The above seems to be some exotic SQL. Is it something like 
>   
>   (select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)
>
> and does MariaDB really support this? Please add a comment about this.

It is covered by the following testcase - I will add a comment:

--8<---------------cut here---------------start------------->8---
--echo # check for subselect in left_expr; adapted from main.ps

create table t1 (a int, b int, c int);
create table t2 (x int, y int, z int);
create table t3 as select * from t1;
insert into t1 values (1,2,3),(4,5,6),(100,200,300),(400,500,600);
insert into t2 values (1,2,3),(7,8,9),(100,200,300),(400,500,600);
insert into t3 values (1,2,3),(11,12,13),(100,0,0),(400,500,600);

# scalar subselect - ok
let $query=
select * from t1 where (select a from t3 where t3.c=t1.c) in (select x from t2 where t1.c= t2.z);
eval explain $query;
eval $query;

# a row containing a subselect - ok
let $query=
select * from t1 where ((select a from t3 where t3.c=t1.c),b) in (select x, y from t2 where t1.c= t2.z);
eval explain $query;
eval $query;

# a single non-scalar subselect - skip transformation
let $query=
select * from t1 where (select a,b from t3 where t3.c=t1.c) in (select x,y from t2 where t1.c= t2.z);
eval explain $query;
eval $query;

drop table t1, t2, t3;
--8<---------------cut here---------------end--------------->8---

>
>> +    DBUG_RETURN(FALSE);
>> +  /* iterate over conditions, and check whether they can be moved out. */
>> +  if (find_inner_outer_equalities(&join->conds, eqs))
>> +    DBUG_RETURN(FALSE);
>> +  /* If we are in the execution of a prepared statement, check for
>> +  intersection with the free list to avoid segfault. Note that the
>> +  check for prepared statement execution is necessary, otherwise it
>> +  will likely always find intersection thus abort the
>> +  transformation.
>> +
>> +  fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is
>> +  defined. It causes CI's like amd64-debian-10-debug-embedded to
>> +  fail. Are there other ways to find out we are in the execution of a
>> +  prepared statement? */
>
> I'm looking at the code of activate_stmt_arena_if_needed(). It seems
> "stmt_arena->is_conventional()" is the check you're looking for.

That's right. I have a more up-to-date patch for MDEV-31269 that uses
this which I will adapt to this patch:

https://github.com/MariaDB/server/commit/ffba2a85948

>
>> +  if (thd->m_statement_state.m_parent_prepared_stmt)
>> +  {
>> +    for (uint i= 0; i < (uint) eqs.elements(); i++)
>> +    {
>> +      if (intersects_free_list(*eqs.at(i).eq_ref, thd))
>> +        DBUG_RETURN(FALSE);
>
> This check doesn't look good. 
> As far as I understand, it is a temporary measure until related
> re-execution bug(s) are fixed by Igor (and/or Sanja)?

sanja mentioned this is too costly. See my comment above about
`intersects_free_list()`. I think an acceptable temporary measure would
be nice, as MDEV-31269 is a segfault rather than wrong results. But I
don't know yet how to make it less costly - let me know if you have any
ideas. Otherwise I suggest this is out of scope for the review of this
patch.

Best,
Yuchen
References

Review input for: MDEV-22534 Decorrelate IN subquery
From: Sergey Petrunia, 2023-05-25