maria-developers team mailing list archive

Thread
Date
Re: Support of the extended syntax for ANALYZE

To: Igor Babaev <igor@xxxxxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Tue, 22 May 2012 19:46:11 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, Sergei Golubchik <serg@xxxxxxxxxxxx>
In-reply-to: <4FB80825.9090503@askmonty.org>
Reply-to: monty@xxxxxxxxxxxx
Hi!

>>>>> "Igor" == Igor Babaev <igor@xxxxxxxxxxxx> writes:

<cut>

>>>> ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes
>>>> ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);

<cut>

Igor> The global server option --stat-tables says now whether the statistics
Igor> tables have to be taken into account (without this option the server
Igor> ignores them altogether)

Igor> There is a system variable (global/session) that controls how the
Igor> statistics table are used by queries/statements. The variable is called
Igor> optimizer_use_stat_tables.

Igor> There are 4 possible values for this variable:
Igor> - 'never'
Igor> - 'complementary'
Igor> - 'preferably'
Igor> - 'exclusively'.

Why two variables?

It should be enough to just have one variable 'use_stat_tables' or
optimize_use_stat_tables.

Igor> The default value for the variable now is 'never'.

Igor> If we start the server with --stat-tables disabled the only meaningful
Igor> value for the variable is 'never'. Other values are ignored
Igor> Q1. Should we issue a warning with the attempt to reset this value
Igor> when --stat-tables is disabled?

Having one variable would solve the above question and make usage easier.

Igor> Let me first tell you how the setting of this variable affects
Igor> regular queries (SELECT, UPDATE etc).

Igor> If the variable is set to:
Igor> 1. case 'never':
Igor>    then no statistics from stat tables is used
Igor> 2. case 'complementary'
Igor>    then the statistical value for a particular attribute is used
Igor>    from persistent tables ONLY when there is no corresponding
Igor>    statistics from other sources
Igor> 3. case 'preferably'
Igor>    then the statistical value for a particular attribute is used
Igor>    from other sources ONLY when there is no corresponding
Igor>    statistics from persistent tables
Igor> 4. case 'exclusively'
Igor>    the only statistics from persistent tables is used, if there is none
Igor>    on some attributes no statistics is used

For what cases would we ever need 4) ?

I think this is option is more dangerous than any of the other options
because if you forget this one on, almost all queries will probably be
table scans and most things will stop working.

Having 1-3 is good enough for me.

Igor> The variable setting is also affects how ANALYZE works.
Igor> The current syntax of the ANALYZE command is

Igor> ANALYZE TABLE t1, ...;

Igor> An currently if @@optimizer_use_stat_tables is set to
Igor> 1. 'never'
Igor>    then it works as before (e.g. in 5.5)
Igor> 2. 'complementary'
Igor>    then it works as before + persistent statistics
Igor>    for table_stat and column_stat is collected
Igor> 3. 'preferably'
Igor>    then it works as before + persistent statistics
Igor>    for table_stat, column_stat and index_stat is collected
Igor> 4. 'exclusively'
Igor>    only persistent statistics for table_stat, column_stat
Igor>    and index_stat is collected

I think 4) is not needed as we can do this if you use the proposed
new analyze syntax.

Igor> Q2. Is this semantics of ANALYZE is natural?

Igor> Collection of persistent statistics with ANALYZE is now engine
Igor> independent. So it's not  a cheap and fast process.
Igor> In many cases it is enough to collect/update statistics only some
Igor> columns or even for some indexes.

Igor> That's why I suggested to extend the syntax of the analyze.
Igor> Let's choose something that is close to what SergeiG suggested.

Igor> There are some questions here.
Igor> Q3. If I need statistics for all columns and some indexes of table t1
Igor>     and everything on table t2.
Igor>     can I use:
Igor>       ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ?

For the first version, lets only do this extended syntax for a table
at a time.

It should preferably work like it works for CHECK TABLE / REPAIR.
In other words: The options should be valid for all tables.
(In this case the only thing that makes sences is to to either only do
columns or indexes or both).

Igor> Q4. Do we really need the keyword PERSISTENT here?
Igor>     If so what is its semantics?

I don't mind the words persistent. I kind of like it as it's giving
the user more information of what is going (ie, that the result will
be stored once and for all).

Igor> Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that
Igor>     it commands to collect statistics for column_stat/index_stat,
Igor>     but not for table_stat? Maybe it's better to omit it?

Comparing:
ANALYZE TABLE t1 PERSISTENT COLUMNS (c,e,d) INDEXES (idx2, idx4);
ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);

The later is more natural english, but I agree that it's not clear
that the command also does table statistics.

However as ANALYZE table does generally do table statistics, it's not
totally wrong.

I tried to think of a new syntax but didn't come up with anything
better than:

ANALYZE TABLE t1 PERSISTENT INCLUDING COLUMNS (c,e,d) INDEXES (idx2, idx4);

but it's harder to remember than simple using FOR

Igor> I do not plan to extend the syntax of ANALYZE to collect/update
Igor> statistics only on some attributes of the columns (like collect
Igor> statistics on min/max values, but not on the number of distinct values).
Igor> Although I admit that collecting statistics on the number of distinct
Igor> values is more expensive I don'tthink such an extension of ANALYZE is
Igor> badly needed now.

What statistics are you collecting for columns?
I thought that the only usable value would be distinct values.
Can we somehow cheaply get an approximation for distinct values?

For example, assume we use the following algorithm:

- Set up dynamic arrays to hold up 1000 distinct values per column
- Start filling the column.
- When one array fills up, remmeber the number of rows involved and
  reset the array.
- The approximated number of distinct values could then be approximated
  as:
  total_fills * 1000 + current_number_in_array

This would be close to correct for:
- Data that is clustered
- Columns with less than 1000 distinct values
- Tables with very few similar values

The worst case scenario is if the same values repeats uniquely every
1000 rows but that is not very likely. Even then, it would not bee too
bad to get a larger value for distinct as there would probably be a
disk io involved for every value we would need to read (as the values
would be far appart on the disk and could not be read on the same fetch).

Igor> I plan to add a possibility to collect statistics from random sample
Igor> records. I plan to control the process of sampling only through system
Igor> variables.

ok.

Igor> How does all above comply with persistent statistics of InnoDB from
Igor> mysql-5.6? Almost perfectly. They store their persistent statistics
Igor> in mysql.innodb_table_stat and mysql.innodb_index_stat. So we do not
Igor> conflict here. The collection and the usage of their persistent
Igor> statistics is going through their own system variables. Again - no
Igor> conflicts.

Igor> I will appreciate any comments on this post and, of course,
Igor> I would like to have some answers for the above questions from
Igor>  SergeiG.

Hope the above answers most/all of your questions?

Don't worry about the syntax. We can always fix that later when we got
some feedback from users!

Regards,
Monty
References

Re: Support of the extended syntax for ANALYZE
From: Sergei Golubchik, 2012-05-19
Re: Support of the extended syntax for ANALYZE
From: Igor Babaev, 2012-05-19
Re: Support of the extended syntax for ANALYZE
From: Sergei Golubchik, 2012-05-19
Re: Support of the extended syntax for ANALYZE
From: Igor Babaev, 2012-05-19