maria-developers team mailing list archive

Thread
Date

Re: Extending storage engine API for random-row extraction for histogram collection (and others)

To: Vicențiu Ciorbaru <vicentiu@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Tue, 11 Dec 2018 13:33:06 +0100
Cc: Oleksandr Byelkin <sanja@xxxxxxxxxxx>, Michael Widenius <michael.widenius@xxxxxxxxx>, Igor Babaev <igor@xxxxxxxxxxx>, maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CACzaYTN_tyeEAxXy64BudXO-w6UNG3j_zcKt3bbkofPZwvBw_A@mail.gmail.com>
User-agent: Mutt/1.7.2 (2016-11-26)

Hi, Vicențiu!

On Dec 11, Vicențiu Ciorbaru wrote:
> Hi!
> 
> Here is my proposal on extending the storage engine API to provide a
> functionality for retrieving random rows from tables (those that have
> indexes). The storage engines for which I plan to implement this are:
> MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
...
> Maybe I am complicating it too much and a simple random_sample()
> function is sufficient, kind of how ha_records_in_range does it.

Yeah. My first thought was something like index_random(). That is,
index_init(), index_random(), index_random(), ..., index_end().

index_init() does whatever it always needs to do to prepare using an
index. and index_random() gets one random row.

Not like records_in_range, that takes the index number and does
_init/_end internally, because records_in_range is typically done only
once or just a few times, while index_random() is more like
index_next(), will be repeated continuously.

But then I was thinking, why do you need to specify an index at all?
Shouldn't it be just "get me a random row"? Index or whatever - that's
engine implementation detail. For example, MyISAM with a fixed-size rows
can just read from lseek(floor((file_size/row_size)*rand())*row_size).

Sergey had a good point about being able to repeat exactly the same
sampling by specifying the seed. This should be solved if the engine
will get its random numbers from the thd_rnd service.

Regards,
Sergei

Follow ups

Re: Extending storage engine API for random-row extraction for histogram collection (and others)
From: Vicențiu Ciorbaru, 2018-12-11

References

Extending storage engine API for random-row extraction for histogram collection (and others)
From: Vicențiu Ciorbaru, 2018-12-10