maria-developers team mailing list archive

Thread
Date
Updated (by Guest): Table elimination (17)

To: sergey@xxxxxxxxxxxx
From: worklog-noreply@xxxxxxxxxxxx
Date: Wed, 28 Oct 2009 02:01:58 +0200 (EET)
Delivered-to: worklog-db@xxxxxxxxxxxx
-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Table elimination
CREATION DATE..: Sun, 10 May 2009, 19:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......: 
CATEGORY.......: Server-Sprint
TASK ID........: 17 (http://askmonty.org/worklog/?tid=17)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 1
ESTIMATE.......: 3 (hours remain)
ORIG. ESTIMATE.: 3

PROGRESS NOTES:

-=-=(Guest - Wed, 28 Oct 2009, 02:01)=-=-
Version updated.
--- /tmp/wklog.17.old.24041     2009-10-28 02:01:58.000000000 +0200
+++ /tmp/wklog.17.new.24041     2009-10-28 02:01:58.000000000 +0200
@@ -1 +1 @@
-9.x
+Server-9.x

-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Category updated.
--- /tmp/wklog.17.old.24882     2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882     2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Client-BackLog
+Server-Sprint

-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Version updated.
--- /tmp/wklog.17.old.24882     2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882     2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Server-5.1
+9.x

-=-=(Guest - Wed, 29 Jul 2009, 21:41)=-=-
Low Level Design modified.
--- /tmp/wklog.17.old.26011     2009-07-29 21:41:04.000000000 +0300
+++ /tmp/wklog.17.new.26011     2009-07-29 21:41:04.000000000 +0300
@@ -2,163 +2,146 @@
 ~maria-captains/maria/maria-5.1-table-elimination tree.
 
 <contents>
-1. Conditions for removal
-1.1 Quick check if there are candidates
-2. Removal operation properties
-3. Removal operation
-4. User interface
-5. Tests and benchmarks
-6. Todo, issues to resolve
-6.1 To resolve
-6.2 Resolved
-7. Additional issues
+1. Elimination criteria
+2. No outside references check
+2.1 Quick check if there are tables with no outside references
+3. One-match check
+3.1 Functional dependency source #1: Potential eq_ref access
+3.2 Functional dependency source #2: col2=func(col1)
+3.3 Functional dependency source #3: One or zero records in the table
+3.4 Functional dependency check implementation
+3.4.1 Equality collection: Option1
+3.4.2 Equality collection: Option2
+3.4.3 Functional dependency propagation - option 1
+3.4.4 Functional dependency propagation - option 2
+4. Removal operation properties
+5. Removal operation
+6. User interface
+6.1 @@optimizer_switch flag
+6.2 EXPLAIN [EXTENDED]
+7. Miscellaneous adjustments
+7.1 Fix used_tables() of aggregate functions
+7.2 Make subquery predicates collect their outer references
+8. Other concerns
+8.1 Relationship with outer->inner joins converter
+8.2 Relationship with prepared statements
+8.3 Relationship with constant table detection
+9. Tests and benchmarks
 </contents>
 
 It's not really about elimination of tables, it's about elimination of inner
 sides of outer joins.
 
-1. Conditions for removal
--------------------------
-We can eliminate an inner side of outer join if:
-1. For each record combination of outer tables, it will always produce
-   exactly one record.
-2. There are no references to columns of the inner tables anywhere else in
+1. Elimination criteria
+=======================
+We can eliminate inner side of an outer join nest if:
+
+1. There are no references to columns of the inner tables anywhere else in 
    the query.
+2. For each record combination of outer tables, it will always produce
+   exactly one matching record combination.
+
+Most of effort in this WL entry is checking these two conditions.
 
-#1 means that every table inside the outer join nest is:
- - is a constant table:
-    = because it can be accessed via eq_ref(const) access, or
-    = it is a zero-rows or one-row MyISAM-like table  [MARK1]
- - has an eq_ref access method candidate.
-
-#2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY,
-  GROUP BY and HAVING do not refer to the inner tables of the outer join
-  nest.
-
-1.1 Quick check if there are candidates
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Before we start to enumerate join nests, here is a quick way to check if
-there *can be* something to be removed:
+2. No outside references check
+==============================
+Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
+outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
+tables of the outer join nest we're trying to remove.
+
+For multi-table UPDATE/DELETE we also must not remove tables that we're
+updating/deleting from or tables that are used in UPDATE's SET clause.
+
+2.1 Quick check if there are tables with no outside references
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Before we start searching for outer join nests that could be eliminated, 
+we'll do a quick and cheap check if there possibly could be something that
+could be eliminated:
 
-  if ((tables used in select_list |  
+  if (there are outer joins && 
+      (tables used in select_list |  
        tables used in group/order by UNION |
-       tables used in where) != bitmap_of_all_tables)
+       tables used in where) != bitmap_of_all_join_tables)
   {
      attempt table elimination;
   }
 
-2. Removal operation properties
--------------------------------
-* There is always one way to remove (no choice to remove either this or that)
-* It is always better to remove as much tables as possible (at least within 
-  our cost model).
-Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
 
-3. Removal operation
---------------------
-* Remove the outer join nest's nested join structure (i.e. get the
-  outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
-  $OJ->embedding->nested_join. Update table_map's of all ancestor nested
-  joins).  [MARK2]
+3. One-match check
+==================
+We can eliminate inner side of outer join if it will always generate exactly
+one matching record combination.
 
-* Move the tables and their JOIN_TABs to front like it is done with const
-  tables, with exception that if eliminated outer join nest was within
-  another outer join nest, that shouldn't prevent us from moving away the
-  eliminated tables. 
+By definition of OUTER JOIN, a NULL-complemented record combination will be
+generated when the inner side of outer join has not produced any matches.
 
-* Update join->table_count and all-join-tables bitmap.
+What remains to be checked is that there is no possiblity that inner side of
+the outer join could produce more than one matching record combination.
 
-* That's it. Nothing else?
+We'll refer to one-match property as "functional dependency":
 
-4. User interface
------------------
-* We'll add an @@optimizer switch flag for table elimination. Tentative
-  name: 'table_elimination'.
-  (Note ^^ utility of the above questioned ^, as table elimination can never 
-   be worse than no elimination. We're leaning towards not adding the flag)
-
-* EXPLAIN will not show the removed tables at all. This will allow to check
-  if tables were removed, and also will behave nicely with anchor model and
-  VIEWs: stuff that user doesn't care about just won't be there.
+- A outer join nest is functionally dependent [wrt outer tables] if it will
+  produce one matching record combination per each record combination of
+  outer tables
 
-5. Tests and benchmarks
------------------------
-Create a benchmark in sql-bench which checks if the DBMS has table
-elimination.
-[According to Monty] Run
- - queries that would use elimination
- - queries that are very similar to one above (so that they would have same
-   QEP, execution cost, etc) but cannot use table elimination.
-then compare run times and make a conclusion about whether dbms supports table
-elimination.
+- A table is functionally dependent wrt certain set of dependency tables, if
+  record combination of dependency tables uniquely identifies zero or one
+  matching record in the table
 
-6. Todo, issues to resolve
---------------------------
+- Definitions of functional dependency of keys (=column tuples) and columns are
+  apparent.
 
-6.1 To resolve
-~~~~~~~~~~~~~~
-- Relationship with prepared statements. 
-  On one hand, it's natural to desire to make table elimination a
-  once-per-statement operation, like outer->inner join conversion. We'll have
-  to limit the applicability by removing [MARK1] as that can change during
-  lifetime of the statement.
-
-  The other option is to do table elimination every time. This will require to
-  rework operation [MARK2] to be undoable.
-  
-  I'm leaning towards doing the former. With anchor modeling, it is unlikely
-  that we'll meet outer joins which have N inner tables of which some are 1-row
-  MyISAM tables that do not have primary key.
-
-6.2 Resolved
-~~~~~~~~~~~~
-* outer->inner join conversion is not a problem for table elimination. 
-  We make outer->inner conversions based on predicates in WHERE. If the WHERE
-  referred to an inner table (requirement for OJ->IJ conversion) then table
-  elimination would not be applicable anyway.
-
-* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause:
-  - affected tables must not be eliminated
-  - tables that are used on the right side of the SET x=y assignments must 
-    not be eliminated either.
+Our goal is to prove that the entire join nest is functionally-dependent.
 
-* Aggregate functions used to report that they depend on all tables, that is,
+Join nest is functionally dependent (on the otside tables) if each of its
+elements (those can be either base tables or join nests) is functionally
+dependent.
    
-     item_agg_func->used_tables() == (1ULL << join->tables) - 1
+Functional dependency is transitive: if table A is f-dependent on the outer 
+tables and table B is f.dependent on {A, outer_tables} then B is functionally
+dependent on the outer tables.
+
+Subsequent sections list cases when we can declare a table to be
+functionally-dependent.
+
+3.1 Functional dependency source #1: Potential eq_ref access
+------------------------------------------------------------
+This is the most practically-important case. Taking the example from the HLD
+of this WL entry:
+
+  select
+    A.colA
+  from
+    tableA A
+  left outer join
+    tableB B
+  on
+    B.id = A.id;
 
-  always. Fixed it, now aggregate function reports it depends on
-  tables that its arguments depend on. In particular, COUNT(*) reports
-  that it depends on no tables (item_count_star->used_tables()==0). 
-  One consequence of that is that "item->used_tables()==0" is not 
-  equivalent to "item->const_item()==true" anymore (not sure if it's 
-  "anymore" or this has been already happening). 
-
-* EXPLAIN EXTENDED warning text was generated after the JOIN object has
-  been discarded. This didn't allow to use information about join plan 
-  when printing the warning. Fixed this by keeping the JOIN objects until 
-  we've printed the warning  (have also an intent to remove the const
-  tables from the join output).
- 
-7. Additional issues
---------------------
-* We remove ON clauses within outer join nests. If these clauses contain
-  subqueries, they probably should be gone from EXPLAIN output also?
-  Yes. Current approach: when removing an outer join nest, walk the ON clause 
-  and mark subselects as eliminated. Then let EXPLAIN code check if the 
-  SELECT was eliminated before the printing (EXPLAIN is generated by doing
-  a recursive descent, so the check will also cause children of eliminated
-  selects not to be printed)
-
-* Table elimination is performed after constant table detection (but before
-  the range analysis). Constant tables are technically different from 
-  eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
-  Considering we've already done the join_read_const_table() call, is there any
-  real difference between constant table and eliminated one? If there is, should
-  we mark const tables also as eliminated?
-  from user/EXPLAIN point of view: no. constant table is the one that we read
-  one record from. eliminated table is the one that we don't acccess at all.
+and generalizing it:  a table TBL is functionally-dependent if the ON 
+expression allows to construct a potential eq_ref access to table TBL that
+uses only outer or functionally-dependent tables.
+
+In other words: table TBL will have one match if the ON expression can be 
+converted into this form 
+
+    TBL.unique_key=func(one_match_tables) AND .. remainder ...
+
+(with appropriate extension for multi-part keys),  where 
+
+  one_match_tables= { 
+    tables that are not on the inner side of the outer join in question, and
+    functionally dependent tables
+  }
+
+Note that this will cover constant tables, except those that are constant because 
+they have 0/1 record or are partitioned and have no used partitions.
+
+
+3.2 Functional dependency source #2: col2=func(col1)
+----------------------------------------------------
+This comes from the second example in the HLS:
 
-* What is described above will not be able to eliminate this outer join 
   create unique index idx on tableB (id, fromDate);
   ...
   left outer join
@@ -169,32 +152,331 @@
     B.fromDate = (select max(sub.fromDate)
                   from tableB sub where sub.id = A.id);
 
-  This is because condition "B.fromDate= func(tableB)" cannot be used.
-  Reason#1: update_ref_and_keys() does not consider such conditions to 
-            be of any use (and indeed they are not usable for ref access)
-            so they are not put into KEYUSE array.
-  Reason#2: even if they were put there, we would need to be able to tell
-            between predicates like
-              B.fromDate= func(B.id)  // guarantees only one matching row as 
-                                      // B.id is already bound by B.id=A.id
-                                      // hence B.fromDate becomes bound too.
-            and
-              "B.fromDate= func(B.*)" // Can potentially have many matching
-                                      // records.
-  We need to
-  - Have update_ref_and_keys() create KEYUSE elements for such equalities
-  - Have eliminate_tables() and friends make a more accurate check.
-  The right check is to check whether all parts of a unique key are bound.
-  If we have keypartX to be bound, then t.keypartY=func(keypartX) makes 
-  keypartY to be bound.
-  The difficulty here is that correlated subquery predicate cannot tell what
-  columns it depends on (it only remembers tables). 
-  Traversing the predicate is expensive and complicated. 
-  We're leaning towards making each subquery predicate have a List<Item> with
-  items that 
-  - are in the current select
-  - and it depends on.
-  This list will be useful in certain other subquery optimizations as well, 
-  it is cheap to collect it in fix_fields() phase, so it will be collected 
-  for every subquery predicate. 
+Here it is apparent that tableB can be eliminated. It is not possible to
+construct eq_ref access to tableB, though, because for the second part of the
+primary key (fromDate column) we only got a condition in this form:
+
+   B.fromDate= func(tableB)
+
+(we write "func(tableB)" because ref optimizer can only determine which tables
+the right part of the equality depends on). 
+
+In general case, equality like this doesn't guarantee functional dependency. 
+For example, if func() == { return fromDate;}, i.e the ON expression is
+
+  ... ON B.id = A.id and B.fromDate = B.fromDate
+
+then that would allow table B to have multiple matches per record of table A. 
+
+In order to be able to distinguish between these two cases, we'll need to go 
+down to column level:
+
+- A table is functionally dependent if it has a unique key that's functionally
+  dependent
+
+- A unique key is functionally dependent when all of its columns are
+  functionally dependent
+
+- A table column is functionally dependent if the ON clause allows to extract
+  an AND-part in this form:
+  
+  tbl.column = f(functionally-dependent columns or columns of outer tables)
+
+3.3 Functional dependency source #3: One or zero records in the table
+---------------------------------------------------------------------
+A table with one or zero records cannot generate more than one matching
+record. This source is of lesser importance as one/zero-record tables are only
+MyISAM tables.
+
+3.4 Functional dependency check implementation
+----------------------------------------------
+As shown above, we need something similar to KEYUSE structures, but not
+exactly that (we need things that current ref optimizer considers unusable and
+don't need things that it considers usable).
+
+3.4.1 Equality collection: Option1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+We could 
+- extend KEYUSE structures to store all kinds of equalities we need 
+- change update_ref_and_keys() and co. to collect equalities both for ref 
+  access and for table elimination
+   = [possibly] Improve [eq_]ref access to be able to use equalities in 
+     form keypart2=func(keypart1) 
+- process the KEYUSE array both by table elimination and by ref access
+  optimizer.
+
++ This requires less effort. 
+- Code will have to be changed all over sql_select.cc
+- update_ref_and_keys() and co. already do several unrelated things. Hooking
+  up table elimination will make it even worse.
+
+3.4.2 Equality collection: Option2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Alternatively, we could process the WHERE clause totally on our own. 
++ Table elimination is standalone and easy to detach module.
+- Some code duplication with update_ref_and_keys() and co.
+
+Having got the equalities, we'll to propagate functional dependency property
+to unique keys, tables and, ultimately, join nests.
+
+3.4.3 Functional dependency propagation - option 1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Borrow the approach used in constant table detection code: 
+  
+  do 
+  {
+    converted= FALSE;
+    for each table T in join nest 
+    {
+      if (check_if_functionally_dependent(T))
+        converted= TRUE;
+    }
+  } while (converted == TRUE);
+
+  check_if_functionally_dependent(T)
+  {
+    if (T has eq_ref access based on func_dep_tables)
+      return TRUE;
+
+    Apply the same do-while loop-based approach to available equalities
+      T.column1=func(other columns)
+    to spread the set of functionally-dependent columns. The goal is to get 
+    all columns of a certain unique key to be bound.
+  }
+
+
+3.4.4 Functional dependency propagation - option 2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Analyze the ON expression(s) and build a list of  
+
+  tbl.field = expr(...) 
+
+equalities. tbl here is a table that belongs to a join nest that could
+potentially be eliminated.
+
+besides those, add to the list 
+ - An element for each unique key in the table that needs to be eliminated
+ - An element for each table that needs to be eliminated
+ - An element for each join nest that can be eliminated (i.e. has no 
+   references from outside).
+
+Then, setup "reverse dependencies": each element should have pointers to
+elements that are functionally dependent on it:
+
+- "tbl.field=expr(...)" equality is functionally dependent on all fields that
+   are used in "expr(...)" (here we take into account only fields that belong 
+   to tables that can potentially be eliminated).
+- a unique key is dependent on all of its components
+- a table is dependent on all of its unique keys
+- a join nest is dependent on all tables that it contains
+
+These pointers are stored in form of one bitmap, such that:
+
+  "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )
+
+Each object also stores a number of dependencies it needs to be satisfied
+before it itself is satisfied:
+
+- "tbl.field=expr(...)" needs all its underlying fields (if a field is
+  referenced many times it is counted only once)
+
+- a unique key needs all of its key parts
+
+- a table needs only one of its unique keys
+
+- a join nest needs all of its tables
+
+(TODO: so what do we do when we've marked a table as constant? We'll need to
+update the "field=expr(....)" elements that use fields of that table. And the 
+problem is that we won't know how much to decrement from the counters of those 
+elements. 
+
+Solution#1: switch to table_map() based approach.
+Solution#2: introduce separate elements for each involved field.
+  field will depend on its table,
+  "field=expr" will depend on fields.
+)
+
+Besides the above, let each element have a pointer to another element, so that
+we can have a linked list of elements. 
+
+After the above structures have been created, we start the main algorithm.
+
+The first step is to create a list of functionally-dependent elements. We walk
+across array of dependencies and mark those elements that are already bound
+(i.e. their dependencies are satisfied). At the moment those immediately-bound
+are only "field=expr" dependencies that don't refer to any columns that are
+not bound. 
+
+The second step is the loop
+
+  while (bound_list is not empty)
+  {
+    Take the first bound element F off the list.
+    Use the bitmap to find out what other elements depended on it
+    for each such element E
+    {
+      if (E becomes bound after F is bound)
+        add E to the list;
+    }
+  }
+  
+The last step is to walk through elements that represent the join nests. Those
+that are bound can be eliminated.
+
+4. Removal operation properties
+===============================
+* There is always one way to remove (no choice to remove either this or that)
+* It is always better to remove as much tables as possible (at least within 
+  our cost model).
+Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
+
+
+5. Removal operation
+====================
+(This depends a lot on whether we make table elimination a one-off rewrite or
+conditional)
+
+At the moment table elimination is re-done for each join re-execution, hence
+the removal operation is designed not to modify any statement's permanent
+members.
+
+* Remove the outer join nest's nested join structure (i.e. get the
+  outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
+  $OJ->embedding->nested_join. Update table_map's of all ancestor nested
+  joins). [MARK2]
+
+* Move the tables and their JOIN_TABs to the front of join order, like it is
+  done with const tables, with exception that if eliminated outer join nest
+  was within another outer join nest, that shouldn't prevent us from moving
+  away the eliminated tables. 
+
+* Update join->table_count and all-join-tables bitmap.
+   ^ TODO: not true anymore ^
+
+* That's it. Nothing else?
+
+6. User interface
+=================
+
+6.1 @@optimizer_switch flag
+---------------------------
+Argument againist adding the flag:
+* It is always better to perform table elimination than not to do it.
+
+Arguments for the flag:
+* It is always theoretically possible that the new code will cause unintended
+  slowdowns.
+* Having the flag is useful for QA and comparative benchmarking.
+
+Decision so far: add the flag under #ifdef. Make the flag be present in debug
+builds.
+
+6.2 EXPLAIN [EXTENDED]
+----------------------
+There are two possible options:
+1. Show eliminated tables, like we do with const tables.
+2. Do not show eliminated tables. 
+
+We chose option 2, because:
+- the table is not accessed at all (besides locking it)
+- it is more natural for anchor model user - when he's querying an anchor-
+  and attributes view, he doesn't care about the unused attributes.
+
+EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. 
+
+NOTE: Before this WL, the warning text was generated after all JOIN objects
+have been destroyed. This didn't allow to use information about join plan 
+when printing the warning. We've fixed this by keeping the JOIN objects until
+the warning text has been generated.
+ 
+Table elimination removes inner sides of outer join, and logically the ON
+clause is also removed. If this clause has any subqueries, they will be 
+also removed from EXPLAIN output.
+
+An exception to the above is that if we eliminate a derived table, it will
+still be shown in EXPLAIN output. This comes from the fact that the FROM
+subqueries are evaluated before table elimination is invoked.
+TODO: Is the above ok or still remove parts of FROM subqueries?
+
+7. Miscellaneous adjustments
+============================
+
+7.1 Fix used_tables() of aggregate functions
+--------------------------------------------
+Aggregate functions used to report that they depend on all tables, that is,
+   
+   item_agg_func->used_tables() == (1ULL << join->tables) - 1
+
+always. Fixed it, now aggregate function reports that it depends on the
+tables that its arguments depend on. In particular, COUNT(*) reports that it
+depends on no tables (item_count_star->used_tables()==0). One consequence of
+that is that "item->used_tables()==0" is not equivalent to
+"item->const_item()==true" anymore (not sure if it's "anymore" or this has 
+been already so for some items). 
+
+7.2 Make subquery predicates collect their outer references
+-----------------------------------------------------------
+Per-column functional dependency analysis requires us to take a 
+  
+  tbl.field = func(...)
+
+equality and tell which columns of which tables are referred from func(...)
+expression. For scalar expressions, this is accomplished by Item::walk()-based
+traversal. It should be reasonably cheap (the only practical Item that can be 
+expensive to traverse seems to be a special case of "col IN (const1,const2,
+...)". check if we traverse the long list for such items). 
+
+For correlated subqueries, traversal can be expensive, it is cheaper to make
+each subquery item have a list of its outer references. The list can be
+collected at fix_fields() stage with very little extra cost, and then it could
+be used for other optimizations.
+
+
+8. Other concerns
+=================
+
+8.1 Relationship with outer->inner joins converter
+--------------------------------------------------
+One could suspect that outer->inner join conversion could get in the way
+of table elimination by changing outer joins (which could be eliminated)
+to inner (which we will not try to eliminate). 
+This concern is not valid: we make outer->inner conversions based on
+predicates in WHERE. If the WHERE referred to an inner table (this is a
+requirement for the conversion) then table elimination would not be
+applicable anyway.
+
+8.2 Relationship with prepared statements
+-----------------------------------------
+On one hand, it's natural to desire to make table elimination a 
+once-per-statement operation, like outer->inner join conversion. We'll have
+to limit the applicability by removing [MARK1] as that can change during
+lifetime of the statement.
+
+The other option is to do table elimination every time. This will require to
+rework operation [MARK2] to be undoable.
+  
+
+8.3 Relationship with constant table detection
+----------------------------------------------
+Table elimination is performed after constant table detection (but before
+the range analysis). Constant tables are technically different from 
+eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
+Considering we've already done the join_read_const_table() call, is there any
+real difference between constant table and eliminated one? If there is, should
+we mark const tables also as eliminated?
+from user/EXPLAIN point of view: no. constant table is the one that we read
+one record from. eliminated table is the one that we don't acccess at all.
+TODO
+
+9. Tests and benchmarks
+=======================
+Create a benchmark in sql-bench which checks if the DBMS has table
+elimination.
+[According to Monty] Run
+ - query Q1 that would use elimination
+ - query Q2 that is very similar to Q1 (so that they would have same
+   QEP, execution cost, etc) but cannot use table elimination.
+then compare run times and make a conclusion about whether the used dbms 
+supports table elimination.
 

-=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=-
Dependency created: 29 now depends on 17

-=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=-
Version updated.
--- /tmp/wklog.17.old.24090     2009-07-23 09:19:32.000000000 +0300
+++ /tmp/wklog.17.new.24090     2009-07-23 09:19:32.000000000 +0300
@@ -1 +1 @@
-Server-9.x
+Server-5.1

-=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=-
deukje weg

Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours).

-=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=-
Version updated.
--- /tmp/wklog.17.old.24138     2009-07-17 02:44:49.000000000 +0300
+++ /tmp/wklog.17.new.24138     2009-07-17 02:44:49.000000000 +0300
@@ -1 +1 @@
-9.x
+Server-9.x

-=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=-
Version updated.
--- /tmp/wklog.17.old.24114     2009-07-17 02:44:36.000000000 +0300
+++ /tmp/wklog.17.new.24114     2009-07-17 02:44:36.000000000 +0300
@@ -1 +1 @@
-Server-5.1
+9.x

-=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=-
Category updated.
--- /tmp/wklog.17.old.24114     2009-07-17 02:44:36.000000000 +0300
+++ /tmp/wklog.17.new.24114     2009-07-17 02:44:36.000000000 +0300
@@ -1 +1 @@
-Server-Sprint
+Client-BackLog

	------------------------------------------------------------

		-=-=(View All Progress Notes, 29 total)=-=-
	http://askmonty.org/worklog/index.pl?tid=17&nolimit=1


DESCRIPTION:

Eliminate not needed tables from SELECT queries..
This will speed up some views and automatically generated queries.

Example:

CREATE TABLE B (id int primary key);

select
 A.colA
from
 tableA A
left outer join
 tableB B
on
 B.id = A.id;

In this case we can remove table B and the join from the query.


HIGH-LEVEL SPECIFICATION:



Here is an extended explanation of table elimination.

Table elimination is a feature found in some modern query optimizers, of
which Microsoft SQL Server 2005/2008 seems to have the most advanced
implementation. Oracle 11g has also been confirmed to use table
elimination but not to the same extent.

Basically, what table elimination does, is to remove tables from the
execution plan when it is unnecessary to include them. This can, of
course, only happen if the right circumstances arise. Let us for example
look at the following query:

select
 A.colA
from
 tableA A
left outer join
 tableB B
on
 B.id = A.id;

When using A as the left table we ensure that the query will return at
least as many rows as there are in that table. For rows where the join
condition (B.id = A.id) is not met the selected column (A.colA) will
still contain it's original value. The not seen B.* row would contain all NULL:s.

However, the result set could actually contain more rows than what is
found in tableA if there are duplicates of the column B.id in tableB. If
A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"]
then two rows will match in the join condition. The only way to know
what the result will look like is to actually touch both tables during
execution.

Instead, let's say that tableB contains rows that make it possible to
place a unique constraint on the column B.id, for example and often the
case a primary key. In this situation we know that we will get exactly
as many rows as there are in tableA, since joining with tableB cannot
introduce any duplicates. If further, as in the example query, we do not
select any columns from tableB, touching that table during execution is
unnecessary. We can remove the whole join operation from the execution
plan.

Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination
in the case described above. Let us look at a more advanced query, where
Oracle fails.

select
 A.colA
from
 tableA A
left outer join
 tableB B
on
 B.id = A.id
and
 B.fromDate = (
   select
     max(sub.fromDate)
   from
     tableB sub
   where
     sub.id = A.id
 );

In this example we have added another join condition, which ensures
that we only pick the matching row from tableB having the latest
fromDate. In this case tableB will contain duplicates of the column
B.id, so in order to ensure uniqueness the primary key has to contain
the fromDate column as well. In other words the primary key of tableB
is (B.id, B.fromDate).

Furthermore, since the subselect ensures that we only pick the latest
B.fromDate for a given B.id we know that at most one row will match
the join condition. We will again have the situation where joining
with tableB cannot affect the number of rows in the result set. Since
we do not select any columns from tableB, the whole join operation can
be eliminated from the execution plan.

SQL Server 2005/2008 will deploy table elimination in this situation as
well. We have not found a way to make Oracle 11g use it for this type of
query. Queries like these arise in two situations. Either when you have
denormalized model consisting of a fact table with several related
dimension tables, or when you have a highly normalized model where each
attribute is stored in its own table. The example with the subselect is
common whenever you store historized/versioned data.


LOW-LEVEL DESIGN:



The code (currently in development) is at lp:
~maria-captains/maria/maria-5.1-table-elimination tree.

<contents>
1. Elimination criteria
2. No outside references check
2.1 Quick check if there are tables with no outside references
3. One-match check
3.1 Functional dependency source #1: Potential eq_ref access
3.2 Functional dependency source #2: col2=func(col1)
3.3 Functional dependency source #3: One or zero records in the table
3.4 Functional dependency check implementation
3.4.1 Equality collection: Option1
3.4.2 Equality collection: Option2
3.4.3 Functional dependency propagation - option 1
3.4.4 Functional dependency propagation - option 2
4. Removal operation properties
5. Removal operation
6. User interface
6.1 @@optimizer_switch flag
6.2 EXPLAIN [EXTENDED]
7. Miscellaneous adjustments
7.1 Fix used_tables() of aggregate functions
7.2 Make subquery predicates collect their outer references
8. Other concerns
8.1 Relationship with outer->inner joins converter
8.2 Relationship with prepared statements
8.3 Relationship with constant table detection
9. Tests and benchmarks
</contents>

It's not really about elimination of tables, it's about elimination of inner
sides of outer joins.

1. Elimination criteria
=======================
We can eliminate inner side of an outer join nest if:

1. There are no references to columns of the inner tables anywhere else in 
   the query.
2. For each record combination of outer tables, it will always produce
   exactly one matching record combination.

Most of effort in this WL entry is checking these two conditions.

2. No outside references check
==============================
Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
tables of the outer join nest we're trying to remove.

For multi-table UPDATE/DELETE we also must not remove tables that we're
updating/deleting from or tables that are used in UPDATE's SET clause.

2.1 Quick check if there are tables with no outside references
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before we start searching for outer join nests that could be eliminated, 
we'll do a quick and cheap check if there possibly could be something that
could be eliminated:

  if (there are outer joins && 
      (tables used in select_list |  
       tables used in group/order by UNION |
       tables used in where) != bitmap_of_all_join_tables)
  {
    attempt table elimination;
  }
  

3. One-match check
==================
We can eliminate inner side of outer join if it will always generate exactly
one matching record combination.

By definition of OUTER JOIN, a NULL-complemented record combination will be
generated when the inner side of outer join has not produced any matches.

What remains to be checked is that there is no possiblity that inner side of
the outer join could produce more than one matching record combination.

We'll refer to one-match property as "functional dependency":

- A outer join nest is functionally dependent [wrt outer tables] if it will
  produce one matching record combination per each record combination of
  outer tables

- A table is functionally dependent wrt certain set of dependency tables, if
  record combination of dependency tables uniquely identifies zero or one
  matching record in the table

- Definitions of functional dependency of keys (=column tuples) and columns are
  apparent.

Our goal is to prove that the entire join nest is functionally-dependent.

Join nest is functionally dependent (on the otside tables) if each of its
elements (those can be either base tables or join nests) is functionally
dependent.

Functional dependency is transitive: if table A is f-dependent on the outer 
tables and table B is f.dependent on {A, outer_tables} then B is functionally
dependent on the outer tables.

Subsequent sections list cases when we can declare a table to be
functionally-dependent.

3.1 Functional dependency source #1: Potential eq_ref access
------------------------------------------------------------
This is the most practically-important case. Taking the example from the HLD
of this WL entry:

  select
    A.colA
  from
    tableA A
  left outer join
    tableB B
  on
    B.id = A.id;

and generalizing it:  a table TBL is functionally-dependent if the ON 
expression allows to construct a potential eq_ref access to table TBL that
uses only outer or functionally-dependent tables.

In other words: table TBL will have one match if the ON expression can be 
converted into this form 

    TBL.unique_key=func(one_match_tables) AND .. remainder ...

(with appropriate extension for multi-part keys),  where 

  one_match_tables= { 
    tables that are not on the inner side of the outer join in question, and
    functionally dependent tables
  }

Note that this will cover constant tables, except those that are constant because 
they have 0/1 record or are partitioned and have no used partitions.


3.2 Functional dependency source #2: col2=func(col1)
----------------------------------------------------
This comes from the second example in the HLS:

  create unique index idx on tableB (id, fromDate);
  ...
  left outer join
    tableB B
  on 
    B.id = A.id
  and
    B.fromDate = (select max(sub.fromDate)
                  from tableB sub where sub.id = A.id);

Here it is apparent that tableB can be eliminated. It is not possible to
construct eq_ref access to tableB, though, because for the second part of the
primary key (fromDate column) we only got a condition in this form:

   B.fromDate= func(tableB)

(we write "func(tableB)" because ref optimizer can only determine which tables
the right part of the equality depends on). 

In general case, equality like this doesn't guarantee functional dependency. 
For example, if func() == { return fromDate;}, i.e the ON expression is

  ... ON B.id = A.id and B.fromDate = B.fromDate

then that would allow table B to have multiple matches per record of table A. 

In order to be able to distinguish between these two cases, we'll need to go 
down to column level:

- A table is functionally dependent if it has a unique key that's functionally
  dependent

- A unique key is functionally dependent when all of its columns are
  functionally dependent

- A table column is functionally dependent if the ON clause allows to extract
  an AND-part in this form:
  
  tbl.column = f(functionally-dependent columns or columns of outer tables)

3.3 Functional dependency source #3: One or zero records in the table
---------------------------------------------------------------------
A table with one or zero records cannot generate more than one matching
record. This source is of lesser importance as one/zero-record tables are only
MyISAM tables.

3.4 Functional dependency check implementation
----------------------------------------------
As shown above, we need something similar to KEYUSE structures, but not
exactly that (we need things that current ref optimizer considers unusable and
don't need things that it considers usable).

3.4.1 Equality collection: Option1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We could 
- extend KEYUSE structures to store all kinds of equalities we need 
- change update_ref_and_keys() and co. to collect equalities both for ref 
  access and for table elimination
   = [possibly] Improve [eq_]ref access to be able to use equalities in 
     form keypart2=func(keypart1) 
- process the KEYUSE array both by table elimination and by ref access
  optimizer.

+ This requires less effort. 
- Code will have to be changed all over sql_select.cc
- update_ref_and_keys() and co. already do several unrelated things. Hooking
  up table elimination will make it even worse.

3.4.2 Equality collection: Option2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alternatively, we could process the WHERE clause totally on our own. 
+ Table elimination is standalone and easy to detach module.
- Some code duplication with update_ref_and_keys() and co.

Having got the equalities, we'll to propagate functional dependency property
to unique keys, tables and, ultimately, join nests.

3.4.3 Functional dependency propagation - option 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Borrow the approach used in constant table detection code: 
  
  do 
  {
    converted= FALSE;
    for each table T in join nest 
    {
      if (check_if_functionally_dependent(T))
        converted= TRUE;
    }
  } while (converted == TRUE);

  check_if_functionally_dependent(T)
  {
    if (T has eq_ref access based on func_dep_tables)
      return TRUE;

    Apply the same do-while loop-based approach to available equalities
      T.column1=func(other columns)
    to spread the set of functionally-dependent columns. The goal is to get 
    all columns of a certain unique key to be bound.
  }


3.4.4 Functional dependency propagation - option 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Analyze the ON expression(s) and build a list of  

  tbl.field = expr(...) 

equalities. tbl here is a table that belongs to a join nest that could
potentially be eliminated.

besides those, add to the list 
 - An element for each unique key in the table that needs to be eliminated
 - An element for each table that needs to be eliminated
 - An element for each join nest that can be eliminated (i.e. has no 
   references from outside).

Then, setup "reverse dependencies": each element should have pointers to
elements that are functionally dependent on it:

- "tbl.field=expr(...)" equality is functionally dependent on all fields that
   are used in "expr(...)" (here we take into account only fields that belong 
   to tables that can potentially be eliminated).
- a unique key is dependent on all of its components
- a table is dependent on all of its unique keys
- a join nest is dependent on all tables that it contains

These pointers are stored in form of one bitmap, such that:

  "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )

Each object also stores a number of dependencies it needs to be satisfied
before it itself is satisfied:

- "tbl.field=expr(...)" needs all its underlying fields (if a field is
  referenced many times it is counted only once)

- a unique key needs all of its key parts

- a table needs only one of its unique keys

- a join nest needs all of its tables

(TODO: so what do we do when we've marked a table as constant? We'll need to
update the "field=expr(....)" elements that use fields of that table. And the 
problem is that we won't know how much to decrement from the counters of those 
elements. 

Solution#1: switch to table_map() based approach.
Solution#2: introduce separate elements for each involved field.
  field will depend on its table,
  "field=expr" will depend on fields.
)

Besides the above, let each element have a pointer to another element, so that
we can have a linked list of elements. 

After the above structures have been created, we start the main algorithm.

The first step is to create a list of functionally-dependent elements. We walk
across array of dependencies and mark those elements that are already bound
(i.e. their dependencies are satisfied). At the moment those immediately-bound
are only "field=expr" dependencies that don't refer to any columns that are
not bound. 

The second step is the loop

  while (bound_list is not empty)
  {
    Take the first bound element F off the list.
    Use the bitmap to find out what other elements depended on it
    for each such element E
    {
      if (E becomes bound after F is bound)
        add E to the list;
    }
  }
  
The last step is to walk through elements that represent the join nests. Those
that are bound can be eliminated.

4. Removal operation properties
===============================
* There is always one way to remove (no choice to remove either this or that)
* It is always better to remove as much tables as possible (at least within 
  our cost model).
Thus, no need for any cost calculations/etc. It's an unconditional rewrite.


5. Removal operation
====================
(This depends a lot on whether we make table elimination a one-off rewrite or
conditional)

At the moment table elimination is re-done for each join re-execution, hence
the removal operation is designed not to modify any statement's permanent
members.

* Remove the outer join nest's nested join structure (i.e. get the
  outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
  $OJ->embedding->nested_join. Update table_map's of all ancestor nested
  joins). [MARK2]

* Move the tables and their JOIN_TABs to the front of join order, like it is
  done with const tables, with exception that if eliminated outer join nest
  was within another outer join nest, that shouldn't prevent us from moving
  away the eliminated tables. 

* Update join->table_count and all-join-tables bitmap.
   ^ TODO: not true anymore ^

* That's it. Nothing else?

6. User interface
=================

6.1 @@optimizer_switch flag
---------------------------
Argument againist adding the flag:
* It is always better to perform table elimination than not to do it.

Arguments for the flag:
* It is always theoretically possible that the new code will cause unintended
  slowdowns.
* Having the flag is useful for QA and comparative benchmarking.

Decision so far: add the flag under #ifdef. Make the flag be present in debug
builds.

6.2 EXPLAIN [EXTENDED]
----------------------
There are two possible options:
1. Show eliminated tables, like we do with const tables.
2. Do not show eliminated tables. 

We chose option 2, because:
- the table is not accessed at all (besides locking it)
- it is more natural for anchor model user - when he's querying an anchor-
  and attributes view, he doesn't care about the unused attributes.

EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. 

NOTE: Before this WL, the warning text was generated after all JOIN objects
have been destroyed. This didn't allow to use information about join plan 
when printing the warning. We've fixed this by keeping the JOIN objects until
the warning text has been generated.
 
Table elimination removes inner sides of outer join, and logically the ON
clause is also removed. If this clause has any subqueries, they will be 
also removed from EXPLAIN output.

An exception to the above is that if we eliminate a derived table, it will
still be shown in EXPLAIN output. This comes from the fact that the FROM
subqueries are evaluated before table elimination is invoked.
TODO: Is the above ok or still remove parts of FROM subqueries?

7. Miscellaneous adjustments
============================

7.1 Fix used_tables() of aggregate functions
--------------------------------------------
Aggregate functions used to report that they depend on all tables, that is,
   
   item_agg_func->used_tables() == (1ULL << join->tables) - 1

always. Fixed it, now aggregate function reports that it depends on the
tables that its arguments depend on. In particular, COUNT(*) reports that it
depends on no tables (item_count_star->used_tables()==0). One consequence of
that is that "item->used_tables()==0" is not equivalent to
"item->const_item()==true" anymore (not sure if it's "anymore" or this has 
been already so for some items). 

7.2 Make subquery predicates collect their outer references
-----------------------------------------------------------
Per-column functional dependency analysis requires us to take a 
  
  tbl.field = func(...)

equality and tell which columns of which tables are referred from func(...)
expression. For scalar expressions, this is accomplished by Item::walk()-based
traversal. It should be reasonably cheap (the only practical Item that can be 
expensive to traverse seems to be a special case of "col IN (const1,const2,
...)". check if we traverse the long list for such items). 

For correlated subqueries, traversal can be expensive, it is cheaper to make
each subquery item have a list of its outer references. The list can be
collected at fix_fields() stage with very little extra cost, and then it could
be used for other optimizations.


8. Other concerns
=================

8.1 Relationship with outer->inner joins converter
--------------------------------------------------
One could suspect that outer->inner join conversion could get in the way
of table elimination by changing outer joins (which could be eliminated)
to inner (which we will not try to eliminate). 
This concern is not valid: we make outer->inner conversions based on
predicates in WHERE. If the WHERE referred to an inner table (this is a
requirement for the conversion) then table elimination would not be
applicable anyway.

8.2 Relationship with prepared statements
-----------------------------------------
On one hand, it's natural to desire to make table elimination a 
once-per-statement operation, like outer->inner join conversion. We'll have
to limit the applicability by removing [MARK1] as that can change during
lifetime of the statement.

The other option is to do table elimination every time. This will require to
rework operation [MARK2] to be undoable.
  

8.3 Relationship with constant table detection
----------------------------------------------
Table elimination is performed after constant table detection (but before
the range analysis). Constant tables are technically different from 
eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
Considering we've already done the join_read_const_table() call, is there any
real difference between constant table and eliminated one? If there is, should
we mark const tables also as eliminated?
from user/EXPLAIN point of view: no. constant table is the one that we read
one record from. eliminated table is the one that we don't acccess at all.
TODO

9. Tests and benchmarks
=======================
Create a benchmark in sql-bench which checks if the DBMS has table
elimination.
[According to Monty] Run
 - query Q1 that would use elimination
 - query Q2 that is very similar to Q1 (so that they would have same
   QEP, execution cost, etc) but cannot use table elimination.
then compare run times and make a conclusion about whether the used dbms 
supports table elimination.


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)