← Back to team overview

maria-developers team mailing list archive

Movable conditions and semi-joins

 

Hi!

The below is an extension of the skype discussion. I wrote down the limitation
imposed by FirstMatch, and also covered other Semi-join strategies.

<contents>
1. SJ-Materialization
2. FirstMatch
3. Duplicate Elimination
4. Loose Scan
</contents>

== 1. SJ-Materialization ==

The general rule is:

  If the condition is originally attached inside an SJM nest, one should 
  not move it outside.  Conditions that are originally outside should not be
  moved inisde the SJM nest.

The reason for the rule is that field values for inner tables are only
available inside the SJM nest.  Correspondingly, values of fields of outer
tables are not available at materialization step.

The only possible exception is when the condition depends only on columns that
are in the IN-equality. Then, it could be moved. I don't think this is worth
handling for the first milestone.

== 2. FirstMatch ==

Consider a query

  select ...
  from ot1, nt1, nt2 
  where expr(ot1) in (select ... from it1, it2, ... where ...)

and a join order

   ot1 FirstMatch(it1, it2)  nt1  nt2


It is possible to move cond(otX) to table itX or ntX. 

There is a problem with moving condition attached to itX (will call it
SUBQ_COND. this ocndition depends on table itX and maybe on ot1) to table ntX.

Consider join execution execution:
   ot1.row0
   ot1.row1 -+- it1.row1 -- it2.row1 
             |
             +- it1.row2 -- it2.row2 ->- (F1)-->-- nt1.row1
                                         (F2)--<--

FirstMatch check is at point (F2). At that point, we know that ot1.row1 
has had a match in the subquery, and the execution returns to table ot1.

If SUBQ_COND is moved to table ntX, then there may be a situation where 
we will get to point (F2) and assume that the subquery had a match, while
actually it hasn't.

A suggestion by Igor: at point (F2) we should check the last return value
of SUBQ_COND (actually, of all conditions that were moved like SUBQ_COND was)

It is actually more complicated: what if there is COND(ot1, nt1) such that
COND(ot1.ro1, nt1.row1) = FALSE?  Then, we will never evaluate 
SUBQ_COND(ot1.row1, ...). 
What if never evaluate SUBQ_COND(ot1.row1, ...) but before we have evaluated
SUBQ_COND(ot1.row0, ...)?  We will have to track what we have evaluated for 
the current row of ot1.row1.

== 3. Duplicate Elimination ==

Duplicate Elimination removes duplicate matches using a temporary table.
In order to work correctly, filtering needs to be performed before the weedout
is done.

Let's again use an example.

  select ...
  from ot1, nt1, nt2 
  where expr(ot1) in (select ... from it1, it2, ... where ...)

and a join order

   (DUPS-W-START) it1 ot1 it2  (DUPS-W-WEEDOUT) nt1 nt2 


at DUPS-W-START, all rows are deleted from the temp. table.
at DUPS-W-WEEDOUT, we take ot1.rowid and attempt to write it into the temp
table.

Basic idea: If a condition SUBQ_COND(itX, ...)  is moved to the right across
the (DUPS-W-WEEDOUT) point, wrong query results will occur.

Example: Suppose, SUBQ_COND(it2) is moved to nt1, and join execution
proceeds as follows:

  it1.row1 --- ot1.row1 --- it2.row1 -- DUPS-W-WEEDOUT -- nt1
    |

- We reach DUPS-W-WEEDOUT with (it1.row1, ot1.row1, it2.row1).  
- The temp. table is empty, so rowid(ot1.row1) is written into temp.table, 
  and Weedout doesn't weed it out.
- At nt1,  SUBQ_COND(it2.row1) is evaluated and found to be false. 
  join execution returns back.
- Suppose, tables  it2 and ot1 have no more matching rows, so execution returns
  to it1.

(the same diagram with a few more steps)

  it1.row1 --- ot1.row1 --- it2.row1 -- DUPS-W-WEEDOUT -- nt1
    |
  it1.row2 --- ot1.row1 --- it2.row2 -- DUPS-W-WEEDOUT


- We reach DUPS-W-WEEDOUT with (it1.row2, ot1.tow1, it2.rowX).
- The temptable already has rowid(ot1.row1). The record combination is filtered
  out. if SUBQ_COND(it2.row2)=TRUE, then the result will miss a row.

== 4. Loose Scan ==

Loose Scan is similar to First Match - it does short-cutting (but also makes
forward jumps on the driving table).

Short-cutting imposes a limitation similar to FirstMatch - conditions that
depend on subquery's tables cannot be moved out of LooseScan's range.

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB | Skype: sergefp | Blog: http://s.petrunia.net/blog