maria-developers team mailing list archive

Thread
Date

Request for Discussion: Structure for JSON histogram objects

To: maria-developers@xxxxxxxxxxxxxxxxxxx, Sergey Petrunia <sergey@xxxxxxxxxxx>
From: Michael Okoko <michaelsokoko@xxxxxxxxx>
Date: Wed, 2 Jun 2021 14:52:27 +0100

Hi, MariaDB community!

I'm implementing the patch for Using JSON as the on-disk format for MariaDB
histograms(https://jira.mariadb.org/browse/MDEV-21130). as part of GSoC '21
and hoping to get insights and ideas from the community on how the JSON
format for histograms should be structured.

We are planning to change the columns of `mysql.column_stats` into
something like:

CREATE TABLE `column_stats` (
`db_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`table_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`column_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`histogram` blob
);

The histogram column above will then be responsible for storing the JSON
document for that row.

Currently, the JSON structure we are looking at is based on the existing
columns in `column_stats` table as well MySQL's JSON histogram
implementation(
https://dev.mysql.com/doc/refman/8.0/en/optimizer-statistics.html). Here's
an example:

{
  "min_value": 0,
  "max_value": 0,
  "hist_type": "singleton",
  "bucket": [
     [1,0.3333333333333333],
     [2, 0.6666666666666666],
     [3,1]
  ],
   "nulls_ratios": 0,
   "last_updated": "2021-06-02 13:32:40.000000",
   "sampling_rate": 1,
}

- The min_value and max_values retain the meanings described at (
https://mariadb.com/kb/en/mysqlcolumn_stats-table/).
- hist_type can be one of `SINGLETON` (where the number of buckets is >=
the number of distinct values in the specified column) or `EQUI-HEIGHT`
(where the number of buckets is < the number of distinct values found in
the column).
- nulls_ratio is a value between 0.0 and 1.0 indicating the fraction of
column values that are SQL NULL values. If 0, the column contains no NULL
values.
- `last_updated` indicates when the histogram was generated, as a UTC value
in YYYY-MM-DD hh:mm:ss.uuuuuu format.
- `sampling_rate` is also a value between 0.0 and 1.0 indicating the
fraction of column values that were sampled to create the histogram. A
value of 1 means that all of the data was read (no sampling).

Follow ups

Re: Request for Discussion: Structure for JSON histogram objects
From: Sergey Petrunia, 2021-06-04