← Back to team overview

maria-developers team mailing list archive

Re: Review request MDEV-6274 Collation usage statistics (for feedback plugin)

 

Hi, Alexander!

On Jul 24, Alexander Barkov wrote:
> On 07/03/2014 01:47 PM, Sergei Golubchik wrote:
> > Hi, Alexander!
> >
> > I don't particularly like the idea of the COLLATIONPROPERTY function,
> > The standard way of getting object metadata is to query information
> > schema. Introducing "property" function has long term implications. What
> > would be the way to get sysvar data?
> >
> >    select VARIABLE_VALUE from I_S.GLOBAL_VARIABLES where VARIABLE_NAME=xxx
> >
> > or
> >
> >    select VARIABLEPROPERTY('xxx', 'Value')
> >
> > and the help text?
> >
> >    select VARIABLEPROPERTY('xxx', 'Help')
> >
> > Storage engine of a table:
> >
> >    select TABLEPROPERTY('xxx', 'Engine')
> >
> > See - you're introducing a new way to access object metadata (and, may
> > be, data too - "property" is kind of ambigous). And what might look like
> > making sense now in 10.0 for collations, isn't necessarily reasonable
> > long term.
> >
> > As far as collations are concerned, I would rather see this data
> > in the information_schema. In the new table or in the I_S.COLLATIONS -
> > I don't know what's better, see also MDEV-6138.
> 
> I'm thinking of adding many collation properties in the future
> (e.g. tailoring, case/accent/kana sensitivity, Unicode version,
> comparison strength, sort strength, various Unicode Collation Algorithm 
> options).
> 
> So I didn't want to flood the standard table 
> INFORMATION_SCHEMA.COLLATIONS with all these properties.

Ah, yes, you're right. INFORMATION_SCHEMA.COLLATIONS is the standard
one. Indeed, it's better to add a new table then.

> It should be fine to add a new table. But I have some concerns:
> 
> 1. I don't like mixing usage statistics and static properties into the
> same table.
> 
> Should we add two new tables with about these structures:
> 
> CREATE TABLE COLLATION_STATISTICS
> (
>    ID BIGINT NOT NULL,
>    USE_COUNT BIGINT NOT NULL
> );
> 
> CREATE TABLE COLLATION_PROPERTIES
> (
>    ID BIGINT NOT NULL,
>    TAILORING TEXT CHARACTER SET UTF8MB4
>    -- things like Unicode version, sensitivity to be added here later
> );

I'm fine with one table, actually.

> 2. To search by name, one will need a join by ID,
> e.g. using a NATURAL JOIN:
> 
> SELECT TAILORING FROM
>    INFORMATION_SCHEMA.COLLATIONS
>    NATURAL JOIN
>    INFORMATION_SCHEMA.COLLATION_PROPERTIES
> WHERE
>    COLLATION_NAME='utf8_polish_ci';
> 
> A little bit cumbersome. But should probably work.

Yes, ok.

> 3. Performance should be better with a function.
> With an I_S table, we'll have to populate the table every time
> with all rows, but in fact the caller in many cases may want
> only one row (e.g. searching by name or ID).
> But performance is rather not too much important here.

There's condition pushdown for I_S tables. It's only used for
table-related I_S tables at the moment, but there's no reason why it
cannot be used for collations, if this will become an issue.

> >> === modified file 'mysys/charset.c'
> >> --- mysys/charset.c	2013-11-08 22:20:07 +0000
> >> +++ mysys/charset.c	2014-05-29 05:20:11 +0000
...
> >> +my_bool my_collation_is_known_id(uint id)
> >> +{
> >> +  return id > 0 && id < array_elements(all_charsets) && all_charsets[id] ?
> >> +         TRUE : FALSE;
> >> +}
> >
> > Hmm. I see this function is mostly used in asserts, which is great.
> > But you also use it in COLLATIONPROPERTY function, and I'm not sure it's
> > correct. One can specify an id which is not loaded yet.
> > COLLATIONPROPERTY should not fail because of that.
> 
> It should not fail.
...
> So a not loaded collation should not erroneously be treated as unknown,
> if you mean this.

Yes.

> >> +/*
> >> +  Collation use statistics functions do not lock
> >> +  counters to avoid mutex contention. This can lose
> >> +  some counter increments with high thread concurrency.
> >> +  But this should be Ok, as we don't need exact numbers.
> >> +*/
> >> +static inline void my_collation_statistics_inc_use_count(uint id)
> >> +{
> >> +  DBUG_ASSERT(my_collation_is_known_id(id));
> >> +  my_collation_statistics[id].use_count++;
> >> +}
> >> +
> >> +
> >> +ulonglong my_collation_statistics_get_use_count(uint id)
> >> +{
> >> +  DBUG_ASSERT(my_collation_is_known_id(id));
> >> +  return my_collation_statistics[id].use_count;
> >
> > Alternatively, you can define it to return 0 for unknown ids.
> > Or -1. Would be a bit easier to use in fill_collation_statistics()
> 
> Not sure. I think my way is more future prove.
> First, you check if a collation is known and thus can be queried.
> Second, on success you query for its particular property.

sure, that's up to you.
that was just a suggestion, based on how you use this function
in fill_collation_statistics().

> >> +                                 "Collation_usecount %s",
> >
> > Let's rather call it "Collation %s used", to be in line with plugin
> > rows in the feedback table. Having simply "%s used" would've been
> > even more consistent, but we don't want to have plugins and
> > collations in the same namespace. I wish I had plugins prefixed with
> > the "Plugin", but it's too late to change.
> 
> Collation names have different lengths, so putting
> collation names as the last part of the value name
> would bring more readable results:
> 
> +---------------------------------------+----+
> | Collation used utf8_unicode_ci        | 10 |
> | Collation used utf8_bin               | 20 |
> | Collation used utf8mb4_unicode_520_ci | 30 |
> +---------------------------------------+----+

Fine.
Readability is not the goal, the result is not really for a human
consumption - the script will read and parse it.

But I don't see why we need to make it intentionally unreadable :)
So, "Collation used %s" is good

Regards,
Sergei



References