← Back to team overview

maria-developers team mailing list archive

Re: Extending storage engine API for random-row extraction for histogram collection (and others)

 

Hi, Vicențiu!

On Dec 11, Vicențiu Ciorbaru wrote:
> Hi!
> 
> Here is my proposal on extending the storage engine API to provide a
> functionality for retrieving random rows from tables (those that have
> indexes). The storage engines for which I plan to implement this are:
> MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
...
> Maybe I am complicating it too much and a simple random_sample()
> function is sufficient, kind of how ha_records_in_range does it.

Yeah. My first thought was something like index_random(). That is,
index_init(), index_random(), index_random(), ..., index_end().

index_init() does whatever it always needs to do to prepare using an
index. and index_random() gets one random row.

Not like records_in_range, that takes the index number and does
_init/_end internally, because records_in_range is typically done only
once or just a few times, while index_random() is more like
index_next(), will be repeated continuously.

But then I was thinking, why do you need to specify an index at all?
Shouldn't it be just "get me a random row"? Index or whatever - that's
engine implementation detail. For example, MyISAM with a fixed-size rows
can just read from lseek(floor((file_size/row_size)*rand())*row_size).

Sergey had a good point about being able to repeat exactly the same
sampling by specifying the seed. This should be solved if the engine
will get its random numbers from the thd_rnd service.

Regards,
Sergei


Follow ups

References