maria-developers team mailing list archive

Thread
Date
Cost analysis: check if optimizer is able to produce a correct E(#scalar_subquery executions)

To: timour@xxxxxxxxxxxx
From: Sergei Petrunia <psergey@xxxxxxxxxxxx>
Date: Thu, 9 Aug 2012 15:02:08 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.18 (2008-05-17)
Let's see if the optimizer is theoretically able to attach the subquery to the
right table for the FirstMatch plan:

On Thu, Aug 09, 2012 at 12:39:00PM +0400, Sergei Petrunia wrote:
> == Modified query ==
... 
> 
> == Benchmarking ==
> 
> Number of times scalar-subquery was executed:
>   Materialization: 9552 times   (same as before) 
>   First-Match:  299 times     (A LOT LESS!)
> 
> Execution times: 
>   Materialization: 5.30 sec   (same as before)
>   First-Match: 1.72 sec       (a lot less, used to be 15 sec! Now it beats
>                                materialization)
> 
Let's check whether the optimizer has sufficient input data to conclude that
moving the scalar-subquery from table partsupp to table part is advantageous.

The data about extra selectivity is basically tab->quick_condition_rows.

Let's take the EXPLAIN and its analysis from the "Cost analysis: optimizer 
statistics vs real dataset properties" email.  Relevant lines will start with
'NOTE>'.

+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
|id|select_type  |table   |type  ||key                |key_len|ref                                    |rows|filtered|Extra                                                    |
+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
| 1|PRIMARY      |nation  |ref   ||n_name             |26     |const                                  |   1|  100.00|Using where; Using index; Using temporary; Using filesort|
| 1|PRIMARY      |supplier|ref   ||i_s_nationkey      |5      |nation.n_nationkey                     | 251|  100.00|                                                         |
| 1|PRIMARY      |partsupp|ref   ||i_ps_suppkey       |4      |supplier.s_suppkey                     |  34|  100.00|Using where                                              |
| 1|PRIMARY      |part    |eq_ref||PRIMARY            |4      |partsupp.ps_partkey                    |   1|  100.00|Using where; FirstMatch(supplier)                        |
| 4|DEP. SUBQUERY|lineitem|ref   ||i_l_suppkey_partkey|10     |partsupp.ps_partkey,partsupp.ps_suppkey|   3|  100.00|Using where                                              |
+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
5 rows in set, 3 warnings (0.01 sec)

=== nation: ===
like above, 1 row. OK.
NOTE> - "using where" won't filter anything out.

=== supplier: ===
see above, 251 rows,  ~ok   412 real...
NOTE> - "using where" won't filter anything out.

=== partsupp: ===
- 800K rows and 10K distinct ps_suppkey, which gives rec_per_key=80 (EXPLAIN shows 34)

- as for data that we will hit:

    select count(*) from nation, supplier, partsupp
    where s_suppkey=ps_suppkey and s_nationkey=n_nationkey and n_name='canada';

    gives 32690 rows. 32690 / 412 = 80, matches rec_per_key.

NOTE> "using where" won't filter anything out (NOT TAKING the scalar-subquery into account)

=== part: ===
- eq_ref, so exactly 1 match. We know DBT-3 dataset is such that it always has one.

NOTE> "using where" is "p_name like forest"! it will filter stuff out!

I haven't checked, but I suppose that condition "p_name like forest%'(*) is not 
correlated with the table access condition, 'p_partkey=partsupp_ps_partkey'.

The email 'Cost analysis: Materialization plan' shows a range access on (*),
with the estimate rows=2378.
Total number of rows in table part is is 200K, InnoDB's estimate is 200999.
This gives selectivity of 0.0118.

=== How many times scalar-subquery will be evaluated ===
Let's take estimate numbers:
1 * 251 * 34 * 1 * 0.0118 = 100.7012 evaluations. 

If the estimates were perfect:

1 * 400 * 80 * 1  * 0.118 = 377.6 evaluations

As mentioned at the top of this email, in reality subquery is evaluated 299
times.

=== Conclusion#1 ===

quick_condition estimates provide information that will allow the optimizer to
make a correct conclusion about # of times that scalar-subquery will be
evaluated after table part.

What remains to be checked: check that our cost formulas will make the plan of 

  { FirstMatch, scalar-subquery attached to table part} 

cheaper than

  { FirstMatch, scalar-subquery attached to table partsupp} 

and cheaper than the SJ-Materialization plan.

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog