maria-developers team mailing list archive

Thread
Date
Re: MySQL Optimizer Trace Questions

To: Christian Rober <christian@xxxxxxxxxxx>
From: Sergey Petrunia <sergey@xxxxxxxxxxx>
Date: Tue, 25 Feb 2014 15:15:24 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAFznR5UAMao4DHq+fbt+29XgKfjRYPmOpKyvA0QQ_K01dw81Xg@mail.gmail.com>
User-agent: Mutt/1.5.21 (2010-09-15)
Hi Christian,

On Thu, Feb 20, 2014 at 06:31:45PM -0500, Christian Rober wrote:
> I work for TokuTek as a Storage Engine developer and I have been using
> the, relatively, new optimizer trace functionality.  I have found the
> results very helpful when combined with analyzing the source code, but
> there a few questions I want to ask.  Here is some output from the
> trace for a recent SELECT query on an InnoDB table:
> 
> At the end of the trace, it shows the index chosen by, what I presume,
> is the optimizer:
> 
> ...
> "considered_execution_plans":
>   [
>     {
>       "plan_prefix": [],
>       "table": "`reports` `a`",
>       "best_access_path":
>         {
>             "considered_access_paths": [
>                {
>                  "access_type": "ref",
>                  "index": "reports_group_id",
>                  "rows": 1.55e6,
>                  "cost": 411057,
>                  "chosen": true
>                },
>                ...
>              ]
>          },
>        "cost_for_plan": 411057,
>        "rows_for_plan": 1.55e6,
>        "chosen": true
>      }
>   ]
> },
> ...
> 
> Here are the cost results for some rejected alternative indexes,
> (note: the row count for two of them is similar to the chosen index):
> 
> ...
> "analyzing_range_alternatives": {
>       "range_scan_alternatives"
> ...
>           "rows": 2377102,
>           "cost": 2.81e6,
> ...
>           "rows": 1554164,
>           "cost": 1.84e6,
> ...
>           "rows": 2346664,
>           "cost": 2.78e6,
> ...
>           "rows": 1554164,
>           "cost": 1.84e6,
> ...
> 
> Here is the cost analysis for a table scan, from the beginning of the
> trace (still not as good as the above chosen index, but close on
> cost):
> 
> ...
> "table": "`reports` `a`",
> "range_analysis":
>  {
>    "table_scan":
>     {
>       "rows": 3108329,
>       "cost": 655076
>     },
> ...
> 
> Related to the fact that the table scan analysis is dominated by the
> large row count, and probably rejected as a valid plan compared to the
> chosen index for that very reason:
> 
> 1. How is the row count more than merely informational to the
> optimizer?  I had the impression the row count did not directly
> contribute to the optimizer's decision as much as the estimated cost,
> or that the row count was already accounted for/factored into in the
> final cost measurement.
> 2.  Related to that question, what is cost exactly?  Is there a unit
> of measurement (ex: Time, Disk Seeks, etc.)?

The cost is "random IO disk seeks". However, a lot of parts of the cost model
do not distinguish between random and sequential disk io, do not account for
presence of caches, etc.

Another important constant is (1/TIME_FOR_COMPARE). This is a cost of
evaluating the WHERE clause (or some part of it).

> 3.  Is cost more important than row count or equally important?

The decision which query plan should be used is made based on cost.

However, the join optimizer's cost calculations put a big penalty on query 
plans that produce lots of rows. See sql_select.cc, 
best_extension_by_limited_search():

...
      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;
...

Note the last part, current_record_count/TIME_FOR_COMPARE. 
current_record_count is "# random disk reads", so adding
current_record_count/10  puts a big penalty on query plans that return a lot of
rows.

> 4.  What is the genesis of the cost variable displayed in the trace?
> I assume it is just a guess on the respective storage engine's part of
> how expensive (in terms of disk access, processing, time, etc.)
> getting the rows will be for the query.

Cost numbers for basic operations (range scan, full table scan) come from the
storage engine. Cost numbers for more complex query plans (joins,
index_merge, etc) come from cost calculations done at the SQL layer.

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB | Skype: sergefp | Blog: http://s.petrunia.net/blog
References

MySQL Optimizer Trace Questions
From: Christian Rober, 2014-02-20