← Back to team overview

maria-developers team mailing list archive

Re: Extending storage engine API for random-row extraction for histogram collection (and others)

 

On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:
> Hi!
> 
> Here is my proposal on extending the storage engine API to provide a
> functionality for retrieving random rows from tables (those that have
> indexes). The storage engines for which I plan to implement this are:
> MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.

Observations:

- as far as I understand, random skip scan is not possible with this API?
(which is probably fine as we expect that sampling will only retrieve a small
fraction of table rows, which means the difference between a forward-walking
skip scan and genuinely random probing is negligible).

- Can the scan return the same row twice?

- Do we want/need a concept of random "seed" which will cause the same rows to
be returned on the same table?

> 
> --- a/sql/handler.h
> +++ b/sql/handler.h
> @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc
>    /** Length of ref (1-8 or the clustered key length) */
>    uint ref_length;
>    FT_INFO *ft_handler;
> -  enum init_stat { NONE=0, INDEX, RND };
> +  enum init_stat { NONE=0, INDEX, RND, RANDOM };
>    init_stat inited, pre_inited;
> ........
> +  virtual int ha_random_sample_init() __attribute__((warn_unused_result))
> +  {
> +    DBUG_ENTER("ha_random_sample_init");
> +    inited= RANDOM;
> +    DBUG_RETURN(random_sample_init());
> +  }
> +  virtual int ha_random_sample(uint inx,
> +                               key_range *min_key,
> +                               key_range *max_key)
> +    __attribute__((warn_unused_result))
> +  {
> +    DBUG_ENTER("ha_random_sample");
> +    DBUG_ASSERT(inited == RANDOM);
> +    DBUG_RETURN(random_sample(inx, min_key, max_key));
> +  }
> +  virtual int ha_random_sample_end() __attribute__((warn_unused_result))
> +  {
> +    DBUG_ENTER("ha_random_sample_end");
> +    inited= NONE;
> +    DBUG_RETURN(random_sample_end());
> +  }
> +
> 
> This is the default implementation for a storage engine which does not
> support it:
> 
> +  virtual int random_sample_init() { return 0; } ;
> +  virtual int random_sample(uint idx, key_range *min_key, key_range
> *max_key)
> +  {
> +    return HA_ERR_WRONG_COMMAND;
> +  }
> +  virtual int random_sample_end() { return 0; };
> 
> Alternative ideas: random_sample_init() takes the idx as a parameter and
> random_sample just fetches a row from the range using the index previously
> specified. The range can be left unspecified with nulls to provide a fetch
> from the full table range.
>  I don't know enough about storage engine internals to know if an index
> declaration within the init function instead of within the "sample"
> function is better. Maybe I am complicating it too much and a simple
> random_sample() function is sufficient, kind of how ha_records_in_range
> does it.
> 
> Thoughts?
> Vicențiu

-- 
BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog




Follow ups

References