← Back to team overview

maria-developers team mailing list archive

Re: Extending storage engine API for random-row extraction for histogram collection (and others)

 

On Tue, 11 Dec 2018 at 12:43 Sergey Petrunia <sergey@xxxxxxxxxxx> wrote:

>
> On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:
> > Hi!
> >
> > Here is my proposal on extending the storage engine API to provide a
> > functionality for retrieving random rows from tables (those that have
> > indexes). The storage engines for which I plan to implement this are:
> > MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
>
> Observations:
>
> - as far as I understand, random skip scan is not possible with this API?
> (which is probably fine as we expect that sampling will only retrieve a
> small
> fraction of table rows, which means the difference between a
> forward-walking
> skip scan and genuinely random probing is negligible).
>

One could add a flag to the init function specify this, but it feels a bit
too much, hence I left it out.


> - Can the scan return the same row twice?
>
> The way I see it now, it is implementation defined within the storage
engine. I'd vote to keep it like that. I don't see a real use case for
ensuring no row duplication inside the storage engine. One can keep a
Unique Object if this is really problematic.


> - Do we want/need a concept of random "seed" which will cause the same
> rows to
> be returned on the same table?
>

I've thought about this. I think it would be useful and it could be passed
via the init function, but I couldn't find similar examples in the API and
I kept it out. Should I add it?

>
> >
> > --- a/sql/handler.h
> > +++ b/sql/handler.h
> > @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc
> >    /** Length of ref (1-8 or the clustered key length) */
> >    uint ref_length;
> >    FT_INFO *ft_handler;
> > -  enum init_stat { NONE=0, INDEX, RND };
> > +  enum init_stat { NONE=0, INDEX, RND, RANDOM };
> >    init_stat inited, pre_inited;
> > ........
> > +  virtual int ha_random_sample_init()
> __attribute__((warn_unused_result))
> > +  {
> > +    DBUG_ENTER("ha_random_sample_init");
> > +    inited= RANDOM;
> > +    DBUG_RETURN(random_sample_init());
> > +  }
> > +  virtual int ha_random_sample(uint inx,
> > +                               key_range *min_key,
> > +                               key_range *max_key)
> > +    __attribute__((warn_unused_result))
> > +  {
> > +    DBUG_ENTER("ha_random_sample");
> > +    DBUG_ASSERT(inited == RANDOM);
> > +    DBUG_RETURN(random_sample(inx, min_key, max_key));
> > +  }
> > +  virtual int ha_random_sample_end() __attribute__((warn_unused_result))
> > +  {
> > +    DBUG_ENTER("ha_random_sample_end");
> > +    inited= NONE;
> > +    DBUG_RETURN(random_sample_end());
> > +  }
> > +
> >
> > This is the default implementation for a storage engine which does not
> > support it:
> >
> > +  virtual int random_sample_init() { return 0; } ;
> > +  virtual int random_sample(uint idx, key_range *min_key, key_range
> > *max_key)
> > +  {
> > +    return HA_ERR_WRONG_COMMAND;
> > +  }
> > +  virtual int random_sample_end() { return 0; };
> >
> > Alternative ideas: random_sample_init() takes the idx as a parameter and
> > random_sample just fetches a row from the range using the index
> previously
> > specified. The range can be left unspecified with nulls to provide a
> fetch
> > from the full table range.
> >  I don't know enough about storage engine internals to know if an index
> > declaration within the init function instead of within the "sample"
> > function is better. Maybe I am complicating it too much and a simple
> > random_sample() function is sufficient, kind of how ha_records_in_range
> > does it.
> >
> > Thoughts?
> > Vicențiu
>
> --
> BR
>  Sergei
> --
> Sergei Petrunia, Software Developer
> MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
>
>
>

References