← Back to team overview

maria-developers team mailing list archive

Fun with 3-value logic (was Re: Review input for: MDEV-22534 Decorrelate IN subquery)

 

On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:
> > +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
> > +      /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
> > +      !(is_top_level_item() ||
> > +        (upper_not && upper_not->is_top_level_item())) ||
> 
> what is the reason behind this condition?
> If we're able to handle top-level NOT IN, why can't we handle subquery in any
> context?
> 
> Can we really handle the NOT IN? (I suspect not, it will suffer from the
> semantics of IN subqueries with NULLs.. 
> I can't find the name for this, it's described e.g. here:
> https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )

Thanks for the pointer. It seems to me from this post that the server
follows a standard 3-value logic (3vl) where `null op foo` evals to
the unknown value for comparison operator op. So indeed the
transformation cannot handle NOT IN without some surgery. I studied
the exists2in transformation more and 3vl and came to the conclusion
that the fix there for the same situation could help with our case
too, that is, replacing the equalities with `inner_field is not null`,
and from outside AND the whole thing with `outer_expr is not null`.

More specifically, consider the simple example of

--8<---------------cut here---------------start------------->8---
create table t1 (a1 int, b1 int);
create table t2 (a2 int, b2 int);
insert into t1 values (0, null), (1, 1);
insert into t2 values (0, 1), (1, null);
select * from t1 where a1 in (select a2 from t2 where b1 = b2);
--8<---------------cut here---------------end--------------->8---

Focusing on the where condition of the outer select, basically what we
are trying to do is (\V denotes "OR over" i.e. ⋁) transforming the lhs
to the rhs from the following identity:

\V_{i: b1 = b2i} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i)

This identity holds in classical logic, but not 3vl which adds an unknown
(abbreviated u) to t and f. In mariadb, unknown seems to be an alias
to null, but it has a different semantics so in the following I write
u as an unknown logic value (where we can use comparison with other
logical expressions) and null as a sql missing value.

The 3vl version as in my broken implementation is

\V_{i: (b1 = b2i) = t} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i) # WRONG!

This identity holds when lhs is t or u, but not f. It breaks, i.e. lhs
= f and rhs = u if and only if both of the following hold

- there exists i s.t. (b1 = b2i) = u and (a1 = a2i) <> f
- forall i s.t. (b1 = b2i) = t, (a1 = a2i) = f

One can verify that the following modified version holds

\V_{i: (b1 = b2i) = t} (a1 = a2i) = (\V_{i: b2i is not null} (a1 = a2i and b1 = b2i)) and (b1 is not null)

Translating back to the query language, we have

select * from t1 where (b1 is not null and (a1, b1) in (select a2, b2 from t2 where b2 is not null));

With this modification, theoretically we don't have to care about
whether the IN subselect is top level, however, there are two issues:

1. When the subquery is indeed top level, do the extra terms lower
   performance?
2. When the subquery is not top level, how do we substitute `this`
   which is an Item_in_subselect with the Item_func_and?

I suspect this is why the exists2in transformation only does the
modification when there is an upper_not, and the transformation is
skipped when the subquery is not toplevel EXISTS nor toplevel NOT
EXISTS.

More to follow.

Best,
Yuchen


References