← Back to team overview

maria-developers team mailing list archive

Extending storage engine API for random-row extraction for histogram collection (and others)

 

Hi!

Here is my proposal on extending the storage engine API to provide a
functionality for retrieving random rows from tables (those that have
indexes). The storage engines for which I plan to implement this are:
MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.

--- a/sql/handler.h
+++ b/sql/handler.h
@@ -2927,7 +2927,7 @@ class handler :public Sql_alloc
   /** Length of ref (1-8 or the clustered key length) */
   uint ref_length;
   FT_INFO *ft_handler;
-  enum init_stat { NONE=0, INDEX, RND };
+  enum init_stat { NONE=0, INDEX, RND, RANDOM };
   init_stat inited, pre_inited;
........
+  virtual int ha_random_sample_init() __attribute__((warn_unused_result))
+  {
+    DBUG_ENTER("ha_random_sample_init");
+    inited= RANDOM;
+    DBUG_RETURN(random_sample_init());
+  }
+  virtual int ha_random_sample(uint inx,
+                               key_range *min_key,
+                               key_range *max_key)
+    __attribute__((warn_unused_result))
+  {
+    DBUG_ENTER("ha_random_sample");
+    DBUG_ASSERT(inited == RANDOM);
+    DBUG_RETURN(random_sample(inx, min_key, max_key));
+  }
+  virtual int ha_random_sample_end() __attribute__((warn_unused_result))
+  {
+    DBUG_ENTER("ha_random_sample_end");
+    inited= NONE;
+    DBUG_RETURN(random_sample_end());
+  }
+

This is the default implementation for a storage engine which does not
support it:

+  virtual int random_sample_init() { return 0; } ;
+  virtual int random_sample(uint idx, key_range *min_key, key_range
*max_key)
+  {
+    return HA_ERR_WRONG_COMMAND;
+  }
+  virtual int random_sample_end() { return 0; };

Alternative ideas: random_sample_init() takes the idx as a parameter and
random_sample just fetches a row from the range using the index previously
specified. The range can be left unspecified with nulls to provide a fetch
from the full table range.
 I don't know enough about storage engine internals to know if an index
declaration within the init function instead of within the "sample"
function is better. Maybe I am complicating it too much and a simple
random_sample() function is sufficient, kind of how ha_records_in_range
does it.

Thoughts?
Vicențiu

Follow ups