← Back to team overview

maria-developers team mailing list archive

Cost analysis: optimizer statistics vs real dataset properties

 

So, we have done:
alter table name add index(n_name);

and now we have:

explain extended
select sql_calc_found_rows
       s_name, s_address
from nation straight_join supplier
where s_suppkey in (select ps_suppkey from partsupp
                    where ps_partkey in (select p_partkey from part
                                         where p_name like 'forest%')
                          and ps_availqty >
                              (select 0.5 * sum(l_quantity)
                               from lineitem
                               where l_partkey = ps_partkey
                                     and l_suppkey = ps_suppkey
                                     and l_shipdate >= date('1994-01-01')
                                     and l_shipdate < date('1994-01-01') +
                                         interval '1' year ))
and s_nationkey = n_nationkey
and n_name = 'canada'
order by s_name
limit 10;
+--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+
|id|select_type       |table      |type  ||key                |key_len|ref                                    |rows|Extra                                                    |
+--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+
| 1|PRIMARY           |nation     |ref   ||n_name             |26     |const                                  |   1|Using where; Using index; Using temporary; Using filesort|
| 1|PRIMARY           |supplier   |ref   ||i_s_nationkey      |5      |nation.n_nationkey                     | 251|                                                         |
| 1|PRIMARY           |<subquery2>|eq_ref||distinct_key       |4      |func                                   |   1|                                                         |
| 2|MATERIALIZED      |part       |range ||p_name             |58     |NULL                                   |2387|Using where; Using index                                 |
| 2|MATERIALIZED      |partsupp   |ref   ||PRIMARY            |4      |part.p_partkey                         |   1|Using where                                              |
| 4|DEPENDENT SUBQUERY|lineitem   |ref   ||i_l_suppkey_partkey|10     |partsupp.ps_partkey,partsupp.ps_suppkey|   3|Using where                                              |
+--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+


=== nation: ===
1 row scanned, for real OK.
- "using where" doesn't filter anything out.

=== supplier ===
- dataset has 412 rows for CANADA  
- rec_per_key: 10L suppliers / 25 countries=400

=== <subquery2> == 
- eq_ref, we know that there's always match
- "using where" won't filter anything out.

=== part ===
- dataset/where match 2378 rows for (p_name like 'forest%')  OK.
- "using where" won't filter anything out.

=== partsupp ===
- rec_per_key: partsupp has 800K rows and 200K distinct ps_partkey values, which
  gives rec_per_key=4  (while EXPLAIN shows 1!)

- select count(*) from partsupp 
  where ps_partkey in (select p_partkey from part where p_name like 'forest%');
  gives 9552 rows.

(rec_per_key=4) * 2378 = 9512 , close to 9552.

- "using where" won't filter anything out (NOT TAKING scalar-subquery into account)



+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
|id|select_type  |table   |type  ||key                |key_len|ref                                    |rows|filtered|Extra                                                    |
+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
| 1|PRIMARY      |nation  |ref   ||n_name             |26     |const                                  |   1|  100.00|Using where; Using index; Using temporary; Using filesort|
| 1|PRIMARY      |supplier|ref   ||i_s_nationkey      |5      |nation.n_nationkey                     | 251|  100.00|                                                         |
| 1|PRIMARY      |partsupp|ref   ||i_ps_suppkey       |4      |supplier.s_suppkey                     |  34|  100.00|Using where                                              |
| 1|PRIMARY      |part    |eq_ref||PRIMARY            |4      |partsupp.ps_partkey                    |   1|  100.00|Using where; FirstMatch(supplier)                        |
| 4|DEP. SUBQUERY|lineitem|ref   ||i_l_suppkey_partkey|10     |partsupp.ps_partkey,partsupp.ps_suppkey|   3|  100.00|Using where                                              |
+--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+
5 rows in set, 3 warnings (0.01 sec)

=== nation: ===
like above, 1 row. OK.
- "using where" won't filter anything out.

=== supplier: ===
see above, 251 rows,  ~ok   412 real...

=== partsupp: ===
- 800K rows and 10K distinct ps_suppkey, which gives rec_per_key=80 (EXPLAIN shows 34)

- as for data that we will hit:
    
    select count(*) from nation, supplier, partsupp 
    where s_suppkey=ps_suppkey and s_nationkey=n_nationkey and n_name='canada';

    gives 32690 rows. 32690 / 412 = 80, matches rec_per_key.

- "using where" won't filter anything out (NOT TAKING scalar-subquery into account)

=== part: ===
- eq_ref, so exactly 1 match. We know DBT-3 dataset is such that it always has one.

- "using where" is "p_name like forest"! it will filter stuff out!
     
     ^^^^^^^^ but does it matter.. subquery is attached to ps? 


********* Some benchmarking *************
Execution times: 
 - FirstMatch is ~15 sec
 - Materialization:  ~5 sec.

With scalar-subquery predicate replaced with TRUE:
 - Materialization: ~0.20 sec.
 - FirstMatch: 0.19 sec.

Number of times the scalar subquery was executed:
- DEP. SUBQUERY: ... 24055 times

- MATERIALIZATION: 9555 times

If we take the cost of scalar-subquery predicate evaluation into account, 
the difference should be around:

  (24055 - 9555) * 5 = 72K

which is sufficient.
The problem is, Materialization executes the scalar-subquery fewer times than 
FirstMatch.  And you expect that taking into account scalar-subquery execution 
cost will cause Materialization not to be used??


BR
 Sergei
-- 
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog