← Back to team overview

maria-developers team mailing list archive

Re: Support of the extended syntax for ANALYZE

 

Serg,
All,

On 05/19/2012 10:51 AM, Sergei Golubchik wrote:
> Hi, Igor!
> 
> On May 19, Igor Babaev wrote:
>> On 05/19/2012 04:22 AM, Sergei Golubchik wrote:
>>> On May 10, Igor Babaev wrote:
>>> Alternatively, it could have an explicit "PERSISTENT" keyword:
>>>
>>>   ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes
>>>   ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
>>
>> We usually don't use key words in plural. So maybe you meant:
>> ANALYZE TABLE t1 PERSISTENT FOR COLUMN (c,e,d) INDEX (idx2, idx4);
>> ?
> 
> I've checked before suggesting this syntax. Both COLUMNS and INDEXES
> already exist as tokens and used somewhere in the grammar.
> 
> For "INDEXES" you can use the existing keys_or_index rule:
> 
> keys_or_index:
>           KEYS {}
>         | INDEX_SYM {}
>         | INDEXES {}
>         ;
> 
> it's typically used in places like that.
> There is no corresponding column_or_columns, but feel free to create it,
> if you'd like.

Ok,
Basically I'm positive about your suggestions. There are some details
I'm doubting about, but I'll dwell on them later.

Let's talk about the semantics of the extended ANALYZE first. We need to
agree upon it because now I plan moving from a 'proof of concept' for
the feature to a 'real' implementation.

For this I have to give you more context.
The engine persistent independent statistics is stored in 3 tables
of the system mysql database: table_stat, column_stat, index_stat.

The particular contents of these tables is not important for the further
discussion here, it's enough to know that:
- table_stat contains some statistics on the table in general
- column_stat contains statistics on table columns
- index_stat contains statistics on index prefixes.

The global server option --stat-tables says now whether the statistics
tables have to be taken into account (without this option the server
ignores them altogether)

There is a system variable (global/session) that controls how the
statistics table are used by queries/statements. The variable is called
optimizer_use_stat_tables.

There are 4 possible values for this variable:
- 'never'
- 'complementary'
- 'preferably'
- 'exclusively'.

The default value for the variable now is 'never'.

If we start the server with --stat-tables disabled the only meaningful
value for the variable is 'never'. Other values are ignored
Q1. Should we issue a warning with the attempt to reset this value
when --stat-tables is disabled?

Let me first tell you how the setting of this variable affects
regular queries (SELECT, UPDATE etc).

If the variable is set to:
1. case 'never':
   then no statistics from stat tables is used
2. case 'complementary'
   then the statistical value for a particular attribute is used
   from persistent tables ONLY when there is no corresponding
   statistics from other sources
3. case 'preferably'
   then the statistical value for a particular attribute is used
   from other sources ONLY when there is no corresponding
   statistics from persistent tables
4. case 'exclusively'
   the only statistics from persistent tables is used, if there is none
   on some attributes no statistics is used

The variable setting is also affects how ANALYZE works.
The current syntax of the ANALYZE command is

ANALYZE TABLE t1, ...;

An currently if @@optimizer_use_stat_tables is set to
1. 'never'
   then it works as before (e.g. in 5.5)
2. 'complementary'
   then it works as before + persistent statistics
   for table_stat and column_stat is collected
3. 'preferably'
   then it works as before + persistent statistics
   for table_stat, column_stat and index_stat is collected
4. 'exclusively'
   only persistent statistics for table_stat, column_stat
   and index_stat is collected

Q2. Is this semantics of ANALYZE is natural?

Collection of persistent statistics with ANALYZE is now engine
independent. So it's not  a cheap and fast process.
In many cases it is enough to collect/update statistics only some
columns or even for some indexes.

That's why I suggested to extend the syntax of the analyze.
Let's choose something that is close to what SergeiG suggested.

There are some questions here.
Q3. If I need statistics for all columns and some indexes of table t1
    and everything on table t2.
    can I use:
      ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ?
Q4. Do we really need the keyword PERSISTENT here?
    If so what is its semantics?
Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that
    it commands to collect statistics for column_stat/index_stat,
    but not for table_stat? Maybe it's better to omit it?

I do not plan to extend the syntax of ANALYZE to collect/update
statistics only on some attributes of the columns (like collect
statistics on min/max values, but not on the number of distinct values).
Although I admit that collecting statistics on the number of distinct
values is more expensive I don'tthink such an extension of ANALYZE is
badly needed now.

I plan to add a possibility to collect statistics from random sample
records. I plan to control the process of sampling only through system
variables.

How does all above comply with persistent statistics of InnoDB from
mysql-5.6? Almost perfectly. They store their persistent statistics
in mysql.innodb_table_stat and mysql.innodb_index_stat. So we do not
conflict here. The collection and the usage of their persistent
statistics is going through their own system variables. Again - no
conflicts.

I will appreciate any comments on this post and, of course,
I would like to have some answers for the above questions from
 SergeiG.

Regards,
Igor.












   then any particular satistics from



> 
> Regards,
> Sergei



Follow ups

References