maria-developers team mailing list archive

Thread
Date
Predictions re whether cost-based predicate pushdown at optimizer phase will cause QEP to change

To: Timour Katchaounov <timour@xxxxxxxxxxxxxxxx>
From: Sergei Petrunia <psergey@xxxxxxxxxxxx>
Date: Fri, 10 Aug 2012 03:06:43 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20120809210052.GO22964@pslp2>
User-agent: Mutt/1.5.18 (2008-05-17)
On Fri, Aug 10, 2012 at 01:00:52AM +0400, Sergei Petrunia wrote:
> Subject: Re: Cost analysis: checking if join optimization costs match the
> 	reality.
> Hi!
> 
> I've ran this experiment: I took the scalar subquery out of the query and then
> tried to compare optimizer costs with execution times. 
> 
> == The query ==
> explain extended
> select sql_calc_found_rows
>        s_name, s_address
> from nation straight_join supplier
> where s_suppkey in (select ps_suppkey from partsupp
>                     where ps_partkey in (select p_partkey from part
>                                          where p_name like 'forest%')
>                    )
> and s_nationkey = n_nationkey
> and n_name = 'CANADA'
> order by s_name
> limit 10;
> +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+
> |id|select_type|table   |type  ||key          |key_len|ref                |rows|Extra                                                    |
> +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+
> | 1|PRIMARY    |nation  |ref   ||n_name       |26     |const              |   1|Using where; Using index; Using temporary; Using filesort|
> | 1|PRIMARY    |supplier|ref   ||i_s_nationkey|5      |nation.n_nationkey | 207|                                                         |
> | 1|PRIMARY    |partsupp|ref   ||i_ps_suppkey |4      |supplier.s_suppkey |  50|Using index                                              |
> | 1|PRIMARY    |part    |eq_ref||PRIMARY      |4      |partsupp.ps_partkey|   1|Using where; FirstMatch(supplier)                        |
> +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+
> 
> +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+
> |id|select_type |table      |type  ||key          |key_len|ref               |rows|Extra                                                    |
> +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+
> | 1|PRIMARY     |nation     |ref   ||n_name       |26     |const             |   1|Using where; Using index; Using temporary; Using filesort|
> | 1|PRIMARY     |supplier   |ref   ||i_s_nationkey|5      |nation.n_nationkey| 207|                                                         |
> | 1|PRIMARY     |<subquery2>|eq_ref||distinct_key |4      |func              |   1|                                                         |
> | 2|MATERIALIZED|part       |range ||p_name       |58     |NULL              |2387|Using where; Using index                                 |
> | 2|MATERIALIZED|partsupp   |ref   ||i_ps_partkey |4      |part.p_partkey    |   2|Using index                                              |
> +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+
> 
> 
> === Execution times ===
> FirstMatch: 0.20 - 0.30 sec.
> Materialization: 0.20 sec  - 0.30 sec
> 
> run times for both fluctuate between 0.20 and 0.30 sec, FirstMatch seems to be
> somewhat slower (dunno if the difference is statistically meaningful)
> 
> == Optimizer costs ===
> FirstMatch:      11029.4
> Materialization:  3594.5

Besides that, we know:
Cost of scalar subquery execution (in choose_plan()):
(gdb) p join->best_read
  $144 = 4.5989999999999993 = 4.6


=== Scalar subquery costs for FirstMatch ===

Number of evaluations includes the selectivity of partsupp that's not show in
EXPLAIN:

fm_n_evaluations= 1 * 207 * 50 * 1 * 0.0118 = 122
fm_total_subq_cost = 4.6 * 122 = 562

fm_adjusted_total_cost= 11029.4 + 562= 11591.4
=== Scalar subquery costs for Materialization ===

sjm_n_evaluations = 2387*1 = 2387
sjm_total_subq_cost = 2387 * 4.6 = 10980.

sjm_adjusted_total_cost= 3594.5 + 10980= 14574.5

That is,

  sjm_adjusted_total_cost > fm_total_subq_cost

which means it should choose FirstMatch!


== Difference is not very satisfying, though ==
The result is not fully satisfying though, because cost numbers are close,
while we have this data on execution:

<spetrunia> Mine: 
<spetrunia>   Materialization: 5.30 sec (3x)
<spetrunia>   First-Match, with 'fixed' scalar predicate:  1.72 sec (1x)
<spetrunia>   FirstMatch, original:  ~15 sec (8x)
<spetrunia> Timour:
<spetrunia>   FirstMatch, mdev-83: 1x
<spetrunia>   Materialization: 5x 
<spetrunia>   FirstMatch, original:  8.3x

that is "fixed FirstMatch" is 3x-5x faster than Materialization, while the costs differ by a factor of
14574.5/11591.4 = 1.25

On the other hand, see email "Cost analysis: optimizer statistics vs real dataset properties":
- supplier.rows =207 ,while in reality it's 412, precise stats gives 400.
- for Materialization plan, partsupp.rows=2, precise stats gives 4.
- for FirstMatch partsupp.rows=50, while precise stats (and data we will hit)
  shows it should be 80.

Last but not the least: that email shows that the scalar-context subquery will
scan 3 rows. 

select 
  (select count(*) from lineitem) / (select count(distinct l_partkey, l_suppkey) from lineitem);

gives 7.5, which means subquery is 2x more expensive than the optimizer thinks
it is.

=== Conclusion ===
- Precise stats will cause different cost numbers. I'm not sure what they will
  be, but there is some hope FirstMatch may still win.


BR
 Sergei
-- 
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
References

Cost analysis: checking if join optimization costs match the reality.
From: Sergei Petrunia, 2012-08-09