← Back to team overview

maria-developers team mailing list archive

Updated (by Igor): Subquery optimization: Avoid recalculating subquery if external fields values found in subquery cache (66)

 

-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subquery optimization: Avoid recalculating subquery if external fields
		values found in subquery cache
CREATION DATE..: Wed, 25 Nov 2009, 22:25
SUPERVISOR.....: Monty
IMPLEMENTOR....: Sanja
COPIES TO......: 
CATEGORY.......: Client-Sprint
TASK ID........: 66 (http://askmonty.org/worklog/?tid=66)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

-=-=(Igor - Wed, 10 Mar 2010, 21:29)=-=-
High Level Description modified.
--- /tmp/wklog.66.old.32188     2010-03-10 21:29:16.000000000 +0000
+++ /tmp/wklog.66.new.32188     2010-03-10 21:29:16.000000000 +0000
@@ -1,3 +1,10 @@
+The goal of this task is to optimize evaluation of subqueries and subquery
+predicates by storing the results of a correlated subquery together with
+correlation parameters in a cache and reusing those results for the same sets of
+parameters.
+
+Here's what is to be done in this task in more details:
+
 Collect all outer items/references (left part of the subquiery and outer
 references inside the subquery) in key string. Compare the string (which
 represents certain value set of the references) against values in hash table and

-=-=(Igor - Wed, 10 Mar 2010, 21:13)=-=-
Dependency created: 91 now depends on 66

-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Category updated.
--- /tmp/wklog.66.old.31558     2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558     2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Client-Sprint

-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Version updated.
--- /tmp/wklog.66.old.31558     2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558     2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-5.3
+9.x

-=-=(Monty - Fri, 29 Jan 2010, 19:07)=-=-
Version updated.
--- /tmp/wklog.66.old.5893      2010-01-29 19:07:10.000000000 +0200
+++ /tmp/wklog.66.new.5893      2010-01-29 19:07:10.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3

-=-=(Psergey - Wed, 20 Jan 2010, 14:50)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.26873     2010-01-20 14:50:41.000000000 +0200
+++ /tmp/wklog.66.new.26873     2010-01-20 14:50:41.000000000 +0200
@@ -4,7 +4,6 @@
 
 To check/discuss:
 -----------------
-* Do we put subquery cache on all levels of subqueries or on highest level only
 * Will there be any means to measure subquery cache hit rate?
 * MySQL-6.0 has a one-element predicate result cache. It is called "left
   expression cache", grep for left_expr_cache in sql/item_subselect.*  
@@ -41,7 +40,12 @@
 - subquery_item_result is 'bool' for subquery predicates, and is of
 some scalar or  ROW(scalar1,...scalarN) type for scalar-context subquery.
 
-We dont support cases when outer_expr or correlation_references are blobs.
+We don't support cases when outer_expr or correlation_references are blobs.
+
+All subquery predicates are cached. That is, if one subquery predicate is
+located within another, both of them will have caches  (one option to reduce
+cache memory usage was to use cache only for the upper-most select. we decided
+against it).
 
 2. Data structure used for the cache
 ------------------------------------

-=-=(Psergey - Wed, 20 Jan 2010, 13:07)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.17649     2010-01-20 13:07:07.000000000 +0200
+++ /tmp/wklog.66.new.17649     2010-01-20 13:07:07.000000000 +0200
@@ -3,7 +3,13 @@
 
 
 To check/discuss:
- To put subquery cache on all levels of subqueries or on highest level only.
+-----------------
+* Do we put subquery cache on all levels of subqueries or on highest level only
+* Will there be any means to measure subquery cache hit rate?
+* MySQL-6.0 has a one-element predicate result cache. It is called "left
+  expression cache", grep for left_expr_cache in sql/item_subselect.*  
+  When this WL is merged with 6.0's optimizations, these two caches will 
+  need to be unified somehow.
 
 
 <contents>

-=-=(Psergey - Mon, 18 Jan 2010, 16:40)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24899     2010-01-18 16:40:16.000000000 +0200
+++ /tmp/wklog.66.new.24899     2010-01-18 16:40:16.000000000 +0200
@@ -1,3 +1,5 @@
+* Target version: base on mysql-5.2 code
+
 All items on which subquery depend could be collected in
 st_select_lex::mark_as_dependent (direct of indirect reference?)
 

-=-=(Psergey - Mon, 18 Jan 2010, 16:37)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24586     2010-01-18 16:37:07.000000000 +0200
+++ /tmp/wklog.66.new.24586     2010-01-18 16:37:07.000000000 +0200
@@ -4,6 +4,11 @@
 Temporary table index should be created by all fields except result field
 (TMP_TABLE_PARAM::keyinfo).
 
+How to fill the temptable
+-------------------------
+Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
+and is supposed to be quite trivial.
+
 How to make lookups into temptable
 ----------------------------------
 We'll reuse approach used by SJ-Materialization in 6.0. 

-=-=(Psergey - Mon, 18 Jan 2010, 16:34)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24328     2010-01-18 16:34:19.000000000 +0200
+++ /tmp/wklog.66.new.24328     2010-01-18 16:34:19.000000000 +0200
@@ -32,8 +32,8 @@
 Question: or perhaps that is not necessarry?
 </questionable>
 
-Execution process
-~~~~~~~~~~~~~~~~~
+Doing the lookup
+~~~~~~~~~~~~~~~~
 SJ-Materialization does lookup in sub_select_sjm(), with this code:
 
     /* Do index lookup in the materialized table */
@@ -42,4 +42,12 @@
     if (res || !sjm->in_equality->val_int())
       DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
 
+The code in this WL will use the same approach
 
+Extracting the value of the subquery predicate
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The goal of making the lookup is to get the value of subquery predicate.
+This is done by creating an Item_field $I which refers to appropriate
+temporary table's field and then subquery_predicate->val_int() will invoke
+$I->val_int(),  subquery_predicate->val_str() will invoke $I->val_str() and so 
+forth.

	------------------------------------------------------------

		-=-=(View All Progress Notes, 17 total)=-=-
	http://askmonty.org/worklog/index.pl?tid=66&nolimit=1


DESCRIPTION:

The goal of this task is to optimize evaluation of subqueries and subquery
predicates by storing the results of a correlated subquery together with
correlation parameters in a cache and reusing those results for the same sets of
parameters.

Here's what is to be done in this task in more details:

Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
return cached result of subquery if the reference values combination has already
been used.

For example in the following subquery:
(L1, L2) IN (SELECT A, B FROM T WHERE T.F1>OTER_FIELD)
set of references to look into the subquery cache is (L1, L2, OTER_FIELD).

The subquery cache should be implemented as simple LRU connected to the subquery.

Size of the subquery cache (in number of results (but maybe in used memory
amount)) is limited by session variable (query parameter?).


HIGH-LEVEL SPECIFICATION:



Attach subquery cache to each Item_subquery. Interface should allow to use hash
or temporary table inside.


To check/discuss:
-----------------
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
  expression cache", grep for left_expr_cache in sql/item_subselect.*  
  When this WL is merged with 6.0's optimizations, these two caches will 
  need to be unified somehow.


<contents>
1. Scope of the task
2. Data structure used for the cache
3. Cache size
4. Interplay with other subquery optimizations
5. User interface
</contents>

1. Scope of the task
--------------------
This WL should handle all subquery predicates, i.e. it should handle these
cases:

  outer_expr IN (SELECT correlated_select)
  outer_expr $CMP$ ALL/ANY (SELECT correlated_select)
  EXISTS (SELECT correlated_select)
  scalar-context subquery: (SELECT correlated_select)

The cache will maintain 
  
  (outer_expr, correlation_references)-> subquery_item_result 

mapping, where
- correlation_references is a list of tablename.column_name that are referred 
  from the correlated_select but tablename is a table that is ouside the
  subquery. 
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or  ROW(scalar1,...scalarN) type for scalar-context subquery.

We don't support cases when outer_expr or correlation_references are blobs.

All subquery predicates are cached. That is, if one subquery predicate is
located within another, both of them will have caches  (one option to reduce
cache memory usage was to use cache only for the upper-most select. we decided
against it).

2. Data structure used for the cache
------------------------------------
There are two data structures available in the codebase that will allow fast 
equality lookups:

1. HASH (mysys/hash.c) tables
2. Temporary tables (the ones that are used for e.g. GROUP BY)

None of them has any support for element eviction on overflow (using LRU or
some other policy). 

Query cache and MyISAM/Maria's key/page cache ought to support some eviction 
mechanism, but code-wise it is not readily reusable, one will need to factor 
it out (or copy it).

We choose to use #2, and not to have any eviction policy. See subsequent
sections for details and reasoning behind the decision.

3. Cache size
-------------
Typically, a cache has some maximum size and a policy which is used to
select a cache entry for removal when the cache becomes full (e.g. find 
and remove the least [recently] used entry)

For this WL entry we will use a cache of infinite size. The reasoning behind
this is that:
- is is easy to do: we have temporary tables that can grow to arbitrarily
  large size while still providing the same insert/lookup interface.
- it suits us: unless the subquery is resolved with one index lookup,
  hitting the cache would be many times cheaper than re-running the
  subquery, so cache is worth having.

4. Interplay with other subquery optimizations
----------------------------------------------
* This WL entry should not care about IN->EXISTS transformation: caching for 
  IN subquery and result of its conversion to EXISTS would work in the same
  way.

* This optimization is orthogonal to <=>ANY -> MIN/MAX rewrite (it will
  work/be useful irrespectively of whether the rewrite has been performed or
  not)

* TODO: compare this with materialization for uncorrelated IN-subqueries. Is 
  this basically the same?
  A: no, it is not:
  - IN-Materialization has to perform full materialization before it can 
    do the first subquery evaluation. This WL's code has almost no startup 
    costs.
  - This optimization has temp.table of (corr_reference, predicate_value),
    while IN-materialization will have (corr_reference) only.

5. User interface
-----------------
* There will be an @@optimizer_switch flag to turn this optimization on and
  off (TODO: name of the flag?)

* TODO: how do we show this in EXPLAIN [EXTENDED]?  The most easiest is to 
  print something in the warning text of EXPLAIN EXTEDED that would indicate 
  use of cache.

* temporary table sizing (max size for heap table, whether to use MyISAM or
  Maria) will be controlled with common temp.table control variables.


LOW-LEVEL DESIGN:



* Target version: base on mysql-5.2 code

All items on which subquery depend could be collected in
st_select_lex::mark_as_dependent (direct of indirect reference?)

Temporary table index should be created by all fields except result field
(TMP_TABLE_PARAM::keyinfo).

How to fill the temptable
-------------------------
Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
and is supposed to be quite trivial.

How to make lookups into temptable
----------------------------------
We'll reuse approach used by SJ-Materialization in 6.0. 

Setup process
~~~~~~~~~~~~~
Setup is performed in the same way as in setup_sj_materialization(),
see the code that starts these lines:

    /*
      Create/initialize everything we will need to index lookups into the
      temptable.
    */

and ends at this line:

      Remove the injected semi-join IN-equalities from join_tab conds. This

<questionable>
We'll also need to check equalities, i.e. do an equivalent of this:

    if (!(sjm->in_equality= create_subq_in_equalities(thd, sjm,
                                                      emb_sj_nest->sj_subq_pred)))
      DBUG_RETURN(TRUE); /* purecov: inspected */

Question: or perhaps that is not necessarry?
</questionable>

Doing the lookup
~~~~~~~~~~~~~~~~
SJ-Materialization does lookup in sub_select_sjm(), with this code:

    /* Do index lookup in the materialized table */
    if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
      DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    if (res || !sjm->in_equality->val_int())
      DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);

The code in this WL will use the same approach

Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The goal of making the lookup is to get the value of subquery predicate.
This is done by creating an Item_field $I which refers to appropriate
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(),  subquery_predicate->val_str() will invoke $I->val_str() and so 
forth.


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)