← Back to team overview

maria-developers team mailing list archive

Re: Request for Discussion: Structure for JSON histogram objects

 

Just to clarify: we are looking at two options currently.
Please find their description below.

== Current mysql.column_stats ==

Currently, MariaDB defines mysql.column_stats as:

CREATE TABLE column_stats (
  db_name       varchar(64) COLLATE utf8mb3_bin NOT NULL,
  table_name    varchar(64) COLLATE utf8mb3_bin NOT NULL,
  column_name   varchar(64) COLLATE utf8mb3_bin NOT NULL,
  min_value     varbinary(255) DEFAULT NULL,
  max_value     varbinary(255) DEFAULT NULL,
  nulls_ratio   decimal(12,4) DEFAULT NULL,
  avg_length    decimal(12,4) DEFAULT NULL,
  avg_frequency decimal(12,4) DEFAULT NULL,
  hist_size     tinyint(3) unsigned DEFAULT NULL,
  hist_type     enum('SINGLE_PREC_HB','DOUBLE_PREC_HB') 
                COLLATE utf8mb3_bin DEFAULT NULL,
  histogram     varbinary(255) DEFAULT NULL,
  PRIMARY KEY (db_name,table_name,column_name)
);

It seems there is some redundancy here: the histogram column is of variable
length, but there is also hist_size column which also stores length of the 
histogram (?).

== Variant #1: Add JSON support to the current scheme

* Keep all the same fields, but add another histogram type: JSON (more on the
* name choice in the next email)
* Change the definition of the 'histogram' column so it is not limited to 255
bytes.

CREATE TABLE column_stats (
  ...
  -- the same columns as before 
  ...
  hist_type     enum('SINGLE_PREC_HB','DOUBLE_PREC_HB', 'JSON')
                COLLATE utf8mb3_bin DEFAULT NULL,
  histogram     blob DEFAULT NULL,
  PRIMARY KEY (db_name,table_name,column_name)
);

Advantages:
* The definition is close to the current definition
* The data from the older version can be kept as-is: no migration or conversion
  is necessary
* The format is backward-compatible: old server binary can read the records
stored in the new format (we'll still need to verify this)

Disadvantages:
- depending on value of hist_type, histogram column holds either JSON or binary
  data, which is not a good data modeling practice.

- JSON histogram holds real bucket bounds. The column's min_value and max_value 
  will be stored twice: First, in min_value and max_value columns, and then 
  in the first/last bound in the histogram.
  On the other hand, separate min_value and max_value columns allow storing
  min/max statistics without collecting the histogram.

== Variant #2: Change to JSON-only ==

New mysql.column_stats will look like this:

CREATE TABLE column_stats (
  db_name       varchar(64) COLLATE utf8mb3_bin NOT NULL,
  table_name    varchar(64) COLLATE utf8mb3_bin NOT NULL,
  column_name   varchar(64) COLLATE utf8mb3_bin NOT NULL,
   
  -- perhaps some columns like avg_length, avg_frequency

  histogram JSON, -- All histograms are in JSON
  PRIMARY KEY (db_name,table_name,column_name)
);

The histogram will be always stored in JSON. Old histograms will 
be converted to the new format.

Advantages
- straightforward data modeling.
- no "baggage" (Do we need old-format histograms, long term?)

Disadvantages
- This is not compatible with the old data format.
- If we chose to support legacy histograms, we'll need to convert them into
  JSON (This should be fairly easy to with DECODE_HISTOGRAM function is useful for that)

Any thoughts on this anyone?

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net




Follow ups

References