← Back to team overview

maria-developers team mailing list archive

Re: Extending storage engine API for random-row extraction for histogram collection (and others)

 

On Tue, 11 Dec 2018 at 14:33 Sergei Golubchik <serg@xxxxxxxxxxx> wrote:

> Hi, Vicențiu!
>
> On Dec 11, Vicențiu Ciorbaru wrote:
> > Hi!
> >
> > Here is my proposal on extending the storage engine API to provide a
> > functionality for retrieving random rows from tables (those that have
> > indexes). The storage engines for which I plan to implement this are:
> > MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
> ...
> > Maybe I am complicating it too much and a simple random_sample()
> > function is sufficient, kind of how ha_records_in_range does it.
>
> Yeah. My first thought was something like index_random(). That is,
> index_init(), index_random(), index_random(), ..., index_end().
>
> index_init() does whatever it always needs to do to prepare using an
> index. and index_random() gets one random row.
>
> Not like records_in_range, that takes the index number and does
> _init/_end internally, because records_in_range is typically done only
> once or just a few times, while index_random() is more like
> index_next(), will be repeated continuously.
>
> But then I was thinking, why do you need to specify an index at all?
> Shouldn't it be just "get me a random row"? Index or whatever - that's
> engine implementation detail. For example, MyISAM with a fixed-size rows
> can just read from lseek(floor((file_size/row_size)*rand())*row_size).
>

I agree that the need for an index seems a bit much. My reasoning was that
I wanted to allow random sampling on a particular range. This could help
for example when one wants to collect histograms for a multi-distribution
dataset, to get individual distributions (if the indexed column is able to
separate them).

A more generic idea would be if one could pass some conditions for random
row retrieval to the storage engine, but it feels like this would
complicate storage engine implementation by quite a bit.

For the first iteration, after considering your input, I'd go with "init
function", "get random row", "end function", without imposing an index, but
somehow passing a (COND or similar) arg to the init function.

Sounds reasonable, or too much?
Vicențiu





Sergey had a good point about being able to repeat exactly the same
> sampling by specifying the seed. This should be solved if the engine
> will get its random numbers from the thd_rnd service.
>
> Regards,
> Sergei
>

Follow ups

References