maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #11565
Re: Extending storage engine API for random-row extraction for histogram collection (and others)
Hi, Vicențiu!
On Dec 11, Vicențiu Ciorbaru wrote:
> Hi!
>
> Here is my proposal on extending the storage engine API to provide a
> functionality for retrieving random rows from tables (those that have
> indexes). The storage engines for which I plan to implement this are:
> MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
...
> Maybe I am complicating it too much and a simple random_sample()
> function is sufficient, kind of how ha_records_in_range does it.
Yeah. My first thought was something like index_random(). That is,
index_init(), index_random(), index_random(), ..., index_end().
index_init() does whatever it always needs to do to prepare using an
index. and index_random() gets one random row.
Not like records_in_range, that takes the index number and does
_init/_end internally, because records_in_range is typically done only
once or just a few times, while index_random() is more like
index_next(), will be repeated continuously.
But then I was thinking, why do you need to specify an index at all?
Shouldn't it be just "get me a random row"? Index or whatever - that's
engine implementation detail. For example, MyISAM with a fixed-size rows
can just read from lseek(floor((file_size/row_size)*rand())*row_size).
Sergey had a good point about being able to repeat exactly the same
sampling by specifying the seed. This should be solved if the engine
will get its random numbers from the thd_rnd service.
Regards,
Sergei
Follow ups
References