maria-developers team mailing list archive

Thread
Date
Re: MDEV-3792: review of the handler part of the cassandra engine

To: Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Sergei Petrunia <psergey@xxxxxxxxxxxx>
Date: Thu, 20 Dec 2012 17:15:02 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20121214112624.GA10093@meddwl.fritz.box>
User-agent: Mutt/1.5.18 (2008-05-17)
Hi Sergei,

On Fri, Dec 14, 2012 at 12:26:24PM +0100, Sergei Golubchik wrote:
> 
> The handler is quite ok. I've only had few mostly cosmetical comments.

Addressed most of them, but some remain. Can we discuss them on IRC? 
> 
> > === modified file 'cmake/build_configurations/mysql_release.cmake'
> > --- cmake/build_configurations/mysql_release.cmake	2012-06-06 12:15:29 +0000
> > +++ cmake/build_configurations/mysql_release.cmake	2012-11-12 15:11:23 +0000
> > @@ -46,6 +46,8 @@ SET(FEATURE_SET_large   5)
> >  SET(FEATURE_SET_xlarge  6)
> >  SET(FEATURE_SET_community 7)
> >  
> > +SET(WITH_CASSANDRA_STORAGE_ENGINE  ON)
> > +
> 
> This probably should be reverted before pushing into main, right?

Yes. I suppose, this will be changed to some other setting which builds
Cassandra in certain configurations.

> >  IF(FEATURE_SET)
> >    STRING(TOLOWER ${FEATURE_SET} feature_set)
> >    SET(num ${FEATURE_SET_${feature_set}})
> > 
> > === modified file 'sql/sql_join_cache.cc'
> > --- sql/sql_join_cache.cc	2012-03-24 17:21:22 +0000
> > +++ sql/sql_join_cache.cc	2012-11-12 15:11:23 +0000
> > @@ -4543,7 +4546,7 @@ bool JOIN_CACHE_BKAH::prepare_look_for_m
> >  {
> >    last_matching_rec_ref_ptr= next_matching_rec_ref_ptr= 0;
> >    if (no_association &&
> > -      (curr_matching_chain= get_matching_chain_by_join_key()))
> > +      !(curr_matching_chain= get_matching_chain_by_join_key())) //psergey: added '!'
> 
> Is that something that should be pushed into the main branch or
> it's a temporary hack for cassandra, that should be reverted?

No. It is a fix for a typo bug in BKA code. The bug can only be observed when
BKA is used with "no-association" storage engine, like Cassandra SE, or
Falcon.  Probably, the 'psergey:' comment should be removed.

> >      return 1;
> >    last_matching_rec_ref_ptr= get_next_rec_ref(curr_matching_chain);
> >    return 0;
> > 
> > === added file 'storage/cassandra/CMakeLists.txt'
> > --- storage/cassandra/CMakeLists.txt	1970-01-01 00:00:00 +0000
> > +++ storage/cassandra/CMakeLists.txt	2012-11-12 15:11:23 +0000
> > @@ -0,0 +1,25 @@
> > +
> > +SET(cassandra_sources 
> > +     ha_cassandra.cc 
> > +     ha_cassandra.h 
> > +     cassandra_se.h
> > +     cassandra_se.cc
> > +     gen-cpp/Cassandra.cpp
> > +     gen-cpp/cassandra_types.h 
> > +     gen-cpp/cassandra_types.cpp
> > +     gen-cpp/cassandra_constants.h 
> > +     gen-cpp/cassandra_constants.cpp
> > +     gen-cpp/Cassandra.h)
> > +
> > +#INCLUDE_DIRECTORIES(BEFORE ${Boost_INCLUDE_DIRS})
> > +
> > +#INCLUDE_DIRECTORIES(AFTER /usr/local/include/thrift)
> > +INCLUDE_DIRECTORIES(AFTER /home/buildbot/build/thrift-inst/include/thrift/) 
> 
> this needs to be fixed before pushing into the main tree
> I suppose, you'll need to detect here if thrift is available
> and disable the engine otherwise
> 
Agree. We need to figure out how to make CMake check for Thrift.

> > +
> > +# 
> > +STRING(REPLACE "-fno-exceptions" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
> > +STRING(REPLACE "-fno-implicit-templates" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
> > +#LINK_DIRECTORIES(/home/psergey/cassandra/thrift/lib)
> > +
> > +MYSQL_ADD_PLUGIN(cassandra ${cassandra_sources} STORAGE_ENGINE LINK_LIBRARIES thrift)
> > +# was: STORAGE_ENGINE MANDATORY 
> > 
...
> > === added file 'storage/cassandra/ha_cassandra.h'
> > --- storage/cassandra/ha_cassandra.h	1970-01-01 00:00:00 +0000
> > +++ storage/cassandra/ha_cassandra.h	2012-11-12 15:11:23 +0000
...
> > +  ulonglong table_flags() const
> > +  {
> > +    /*
> > +      HA_BINLOG_STMT_CAPABLE
> > +        We are saying that this engine is just statement capable to have
> > +        an engine that can only handle statement-based logging. This is
> > +        used in testing.
> > +      HA_REC_NOT_IN_SEQ
> > +        If we don't set it, filesort crashes, because it assumes rowids are
> > +        1..8 byte numbers
> > +    */
> > +    return HA_BINLOG_STMT_CAPABLE |
> > +           HA_REC_NOT_IN_SEQ;
> 
> HA_NO_TRANSACTIONS is unset. Is this engine transactional?
No. I've added the flag.

> HA_PARTIAL_COLUMN_READ is unset. Do you always work with all columns, ignoring read_set and write_set?
Currently, Yes.

> HA_TABLE_SCAN_ON_INDEX is unset. Do you store the data MyISAM-style, index and data in separate files?
> HA_FAST_KEY_READ is unset. is reading keys in order faster in cassandra, than random reads?
TODO:Not sure about the above two flags, let's discuss them.

> HA_REQUIRE_PRIMARY_KEY is unset. but a primary key is required.
added the flag.

> HA_PRIMARY_KEY_IN_READ_INDEX is unset. but as far as I see, primary key columns are always returned (and needed for position())
added the flag.

> HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is unset, but a primary key is required for position()
added the flag.

> HA_NO_AUTO_INCREMENT is unset. is auto-increment supported?
added the flag.
> 
> > +
> > +  }
...
> > +  uint max_supported_key_length()    const { return 16*1024; /* just to return something*/ }
> > +
> > +  int index_init(uint idx, bool sorted);
> > +
> > +  int index_read_map(uchar * buf, const uchar * key,
> > +                     key_part_map keypart_map,
> > +                     enum ha_rkey_function find_flag);
> > +
> > +  /** @brief
> > +    Called in test_quick_select to determine if indexes should be used.
> > +  */
> > +  virtual double scan_time() { return (double) (stats.records+stats.deleted) / 20.0+10; }
> 
> does that make any sense? (I understand that it's copy-paste)
> stats.deleted, for example, is never set.
> 
I've added "stats.deleted= 0" to ha_cassandra::info(). I'd prefer to keep this
default impementation for the function.

> > +
> > +  /** @brief
> > +    This method will never be called if you do not implement indexes.
> > +  */
> > +  virtual double read_time(uint, uint, ha_rows rows)
> > +  { return (double) rows /  20.0+1; }
> 
> same question
> 
...

> > === added file 'storage/cassandra/ha_cassandra.cc'
> > --- storage/cassandra/ha_cassandra.cc	1970-01-01 00:00:00 +0000
> > +++ storage/cassandra/ha_cassandra.cc	2012-11-12 15:11:23 +0000
...
> > +
> > +static int cassandra_init_func(void *p)
> > +{
> > +  DBUG_ENTER("cassandra_init_func");
> > +
> > +#ifdef HAVE_PSI_INTERFACE
> > +  init_cassandra_psi_keys();
> > +#endif
> > +
> > +  cassandra_hton= (handlerton *)p;
> > +  mysql_mutex_init(ex_key_mutex_example, &cassandra_mutex, MY_MUTEX_INIT_FAST);
> > +  (void) my_hash_init(&cassandra_open_tables,system_charset_info,32,0,0,
> > +                      (my_hash_get_key) cassandra_get_key,0,0);
> > +
> > +  cassandra_hton->state=   SHOW_OPTION_YES;
> > +  cassandra_hton->create=  cassandra_create_handler;
> > +  /*
> > +    Don't specify HTON_CAN_RECREATE in flags. re-create is used by TRUNCATE
> > +    TABLE to create an *empty* table from scratch. Cassandra table won't be
> > +    emptied if re-created.
> > +  */
> 
> HTON_ALTER_NOT_SUPPORTED is not set. ALTER works?

Yes. Cassandra table is a view into the column family. We support instant
ALTER TABLE, as long as the new DDL makes sense.

> HTON_TEMPORARY_NOT_SUPPORTED is not set. CREATE TEMPORARY TABLE works?
> HTON_NO_PARTITION is not set. Partitioning works?
> 
> > +  cassandra_hton->flags=   0;
> > +  cassandra_hton->table_options= cassandra_table_option_list;
> > +  cassandra_hton->field_options= cassandra_field_option_list;
> > +
> > +  mysql_mutex_init(0 /* no instrumentation */,
> 
> why no instrumentation ?
> 
> > +                   &cassandra_default_host_lock, MY_MUTEX_INIT_FAST);
> > +
> > +  DBUG_RETURN(0);
> > +}
> > +
> > +
...

> > +
> > +ColumnDataConverter *map_field_to_validator(Field *field, const char *validator_name)
> > +{
> > +  ColumnDataConverter *res= NULL;
> > +
> > +  switch(field->type()) {
> > +    case MYSQL_TYPE_TINY:
> > +      if (!strcmp(validator_name, validator_boolean))
> 
> why do you strcmp here, while you're only check selected characters in
> get_cassandra_type() ?
> you could've called get_cassandra_type() for validator_name instead
> and compare enum values, instead of strings.

This function (written by me) was here before the get_cassandra_type() was
written by Sanja. 

I don't particularly like the idea of checking characters: what if the next
version of Cassandra gets a new datatype? The code in this function will
refuse to work with it, and produce a meaningful error message.

The code in Sanja's function will silently map it to some datatype, and the 
conversion will not make sense.

map_field_to_validator() is not exactly a hotspot function. Establishing
connection to cassandra currently requires:
- making a TCP connection 
- Requesting Thrift-packed DDL for the column family from Cassandra
- Receiving the mentioned DDL via Thrift (which populates the data structures)
- Checking the needed columns

Apparently, map_field_to_validator() is not the bottleneck.  If connection
setup is an issue, we need a connection pool.  Why optimize
map_field_to_validator to the point of unreadability?

> > +      {
> > +        res= new TinyintDataConverter;
> > +        break;
> > +      }
> > +      /* fall through: */
> > +    case MYSQL_TYPE_SHORT:
> > +    case MYSQL_TYPE_LONGLONG:
> > +    {
> > +      bool is_counter= false;
> > +      if (!strcmp(validator_name, validator_bigint) ||
> > +          !strcmp(validator_name, validator_timestamp) ||
> > +          (is_counter= !strcmp(validator_name, validator_counter)))
> > +        res= new BigintDataConverter(!is_counter);
> > +      break;
> > +    }
> > +    case MYSQL_TYPE_FLOAT:
> > +      if (!strcmp(validator_name, validator_float))
> > +        res= new FloatDataConverter;
> > +      break;
> > +
> > +    case MYSQL_TYPE_DOUBLE:
> > +      if (!strcmp(validator_name, validator_double))
> > +        res= new DoubleDataConverter;
> > +      break;
> > +
> > +    case MYSQL_TYPE_TIMESTAMP:
> > +      if (!strcmp(validator_name, validator_timestamp))
> > +        res= new TimestampDataConverter;
> > +      break;
> > +
> > +    case MYSQL_TYPE_STRING: // these are space padded CHAR(n) strings.
> > +      if (!strcmp(validator_name, validator_uuid) &&
> > +          field->real_type() == MYSQL_TYPE_STRING &&
> > +          field->field_length == 36)
> > +      {
> > +        // UUID maps to CHAR(36), its text representation
> > +        res= new UuidDataConverter;
> > +        break;
> > +      }
> > +      /* fall through: */
> > +    case MYSQL_TYPE_VAR_STRING:
> > +    case MYSQL_TYPE_VARCHAR:
> > +    {
> > +      /*
> > +        Cassandra's "varint" type is a binary-encoded arbitary-length
> > +        big-endian number.
> > +        - It can be mapped to VARBINARY(N), with sufficiently big N.
> > +        - If the value does not fit into N bytes, it is an error. We should not
> > +          truncate it, because that is just as good as returning garbage.
> > +        - varint should not be mapped to BINARY(N), because BINARY(N) values
> > +          are zero-padded, which will work as multiplying the value by
> > +          2^k for some value of k.
> > +      */
> > +      if (field->type() == MYSQL_TYPE_VARCHAR &&
> > +          field->binary() &&
> > +          (!strcmp(validator_name, validator_varint) ||
> > +           !strcmp(validator_name, validator_decimal)))
> > +      {
> > +        res= new StringCopyConverter(field->field_length);
> > +        break;
> > +      }
> > +
> > +      if (!strcmp(validator_name, validator_blob) ||
> > +          !strcmp(validator_name, validator_ascii) ||
> > +          !strcmp(validator_name, validator_text))
> > +      {
> > +        res= new StringCopyConverter((size_t)-1);
> > +      }
> > +      break;
> > +    }
> > +    case MYSQL_TYPE_LONG:
> > +      if (!strcmp(validator_name, validator_int))
> > +        res= new Int32DataConverter;
> > +      break;
> > +
> > +    default:;
> > +  }
> > +  return res;
> > +}
> > +
> > +
...

> > +
> > +int ha_cassandra::index_init(uint idx, bool sorted)
> > +{
> > +  int ires;
> > +  if (!se && (ires= connect_and_check_options(table)))
> > +    return ires;
> > +  return 0;
> > +}
> > +
> > +void store_key_image_to_rec(Field *field, uchar *ptr, uint len);
> > +
> > +int ha_cassandra::index_read_map(uchar *buf, const uchar *key,
> > +                                 key_part_map keypart_map,
> > +                                 enum ha_rkey_function find_flag)
> > +{
> > +  int rc= 0;
> > +  DBUG_ENTER("ha_cassandra::index_read_map");
> > +
> > +  if (find_flag != HA_READ_KEY_EXACT)
> > +    DBUG_RETURN(HA_ERR_WRONG_COMMAND);
> 
> did you verify that the server expects HA_ERR_WRONG_COMMAND from
> this method and correctly handles it?
> or, perhaps, find_flag is always HA_READ_KEY_EXACT here,
> because of your index_flags() ?

The latter is true. I don't expect that this function is ever invoked with 
find_flag!=HA_READ_KEY_EXACT.

> > +
> > +  uint key_len= calculate_key_len(table, active_index, key, keypart_map);
> > +  store_key_image_to_rec(table->field[0], (uchar*)key, key_len);
> > +
> > +  char *cass_key;
> > +  int cass_key_len;
> > +  my_bitmap_map *old_map;
> > +
> > +  old_map= dbug_tmp_use_all_columns(table, table->read_set);
> > +
> > +  if (rowkey_converter->mariadb_to_cassandra(&cass_key, &cass_key_len))
> > +  {
> > +    /* We get here when making lookups like uuid_column='not-an-uuid' */
> > +    dbug_tmp_restore_column_map(table->read_set, old_map);
> > +    DBUG_RETURN(HA_ERR_KEY_NOT_FOUND);
> > +  }
> > +
> > +  dbug_tmp_restore_column_map(table->read_set, old_map);
> > +
> > +  bool found;
> > +  if (se->get_slice(cass_key, cass_key_len, &found))
> > +  {
> > +    my_error(ER_INTERNAL_ERROR, MYF(0), se->error_str());
> 
> generally it's better not to use my_error() directly,
> but only return an error code, and return your se->error_str()
> from ha_cassandra::get_error_message()
> 
> > +    rc= HA_ERR_INTERNAL_ERROR;
> > +  }
> > +
> > +  /* TODO: what if we're not reading all columns?? */
> > +  if (!found)
> > +    rc= HA_ERR_KEY_NOT_FOUND;
> > +  else
> > +    rc= read_cassandra_columns(false);
> > +
> > +  DBUG_RETURN(rc);
> > +}
> > +
...

> > +int ha_cassandra::rnd_pos(uchar *buf, uchar *pos)
> > +{
> > +  int rc;
> > +  DBUG_ENTER("ha_cassandra::rnd_pos");
> > +
> > +  int save_active_index= active_index;
> > +  active_index= 0; /* The primary key */
> > +  rc= index_read_map(buf, pos, key_part_map(1), HA_READ_KEY_EXACT);
> > +
> > +  active_index= save_active_index;
> 
> a bit nicer would be to implement index_read_map_idx
> and call it from here and from index_read_map
> 
> > +
> > +  DBUG_RETURN(rc);
> > +}
> > +
> ...
> > +bool ha_cassandra::mrr_start_read()
> > +{
> > +  uint key_len;
> > +
> > +  my_bitmap_map *old_map;
> > +  old_map= dbug_tmp_use_all_columns(table, table->read_set);
> > +
> > +  se->new_lookup_keys();
> > +
> > +  while (!(source_exhausted= mrr_funcs.next(mrr_iter, &mrr_cur_range)))
> > +  {
> > +    char *cass_key;
> > +    int cass_key_len;
> > +
> > +    DBUG_ASSERT(mrr_cur_range.range_flag & EQ_RANGE);
> > +
> > +    uchar *key= (uchar*)mrr_cur_range.start_key.key;
> > +    key_len= mrr_cur_range.start_key.length;
> > +    //key_len= calculate_key_len(table, active_index, key, keypart_map); // NEED THIS??
> > +    store_key_image_to_rec(table->field[0], (uchar*)key, key_len);
> > +
> > +    rowkey_converter->mariadb_to_cassandra(&cass_key, &cass_key_len);
> > +
> > +    // Primitive buffer control
> > +    if (se->add_lookup_key(cass_key, cass_key_len) >
> > +        THDVAR(table->in_use, multiget_batch_size))
> > +      break;
> 
> I'd suggest to add a status variable to count
> how many times the buffer was refilled
> 
Agree. I'd prefer to do this in the next version.

> > +  }
> > +
> > +  dbug_tmp_restore_column_map(table->read_set, old_map);
> > +
> > +  return se->multiget_slice();
> > +}
...
> > +
> > +
> > +/* The following function was copied from ha_blackhole::store_lock: */
> > +THR_LOCK_DATA **ha_cassandra::store_lock(THD *thd,
> > +                                         THR_LOCK_DATA **to,
> > +                                         enum thr_lock_type lock_type)
> > +{
> > +  DBUG_ENTER("ha_cassandra::store_lock");
> > +  if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
> > +  {
> > +    /*
> > +      Here is where we get into the guts of a row level lock.
> > +      If TL_UNLOCK is set
> > +      If we are not doing a LOCK TABLE or DISCARD/IMPORT
> > +      TABLESPACE, then allow multiple writers
> > +    */
> > +
> > +    if ((lock_type >= TL_WRITE_CONCURRENT_INSERT &&
> > +         lock_type <= TL_WRITE) && !thd_in_lock_tables(thd)
> > +        && !thd_tablespace_op(thd))
> 
> 1. tablespace op in cassandra? really? too much copy-paste is confusing.
> (here and in the comments too)
> 2. did you test LOCK TABLES?

What is a meaningful usage scenario of cassandra SE with LOCK TABLES? It is a
distributed engine. Do locks matter at all?


> > +      lock_type = TL_WRITE_ALLOW_WRITE;
> 
> that makes all changes to cassanrda immediately visible
> by concurrently running threads. okay, if that's intentional,
> but it needs to be documented, as it not the SQL semantics.
>
It is intentional, Cassandra storage semantics is different from SQL.

> > +
> > +    /*
> > +      In queries of type INSERT INTO t1 SELECT ... FROM t2 ...
> > +      MySQL would use the lock TL_READ_NO_INSERT on t2, and that
> > +      would conflict with TL_WRITE_ALLOW_WRITE, blocking all inserts
> > +      to t2. Convert the lock to a normal read lock to allow
> > +      concurrent inserts to t2.
> > +    */
> > +
> > +    if (lock_type == TL_READ_NO_INSERT && !thd_in_lock_tables(thd))
> > +      lock_type = TL_READ;
> 
> this also breaks SBR for INSERT ... SELECT. same as above - okay if
> intentional, but should be thouroughly documented.
>
I am not certain about this. Let's discuss it.
> > +
> > +    lock.type= lock_type;
> > +  }
> > +  *to++= &lock;
> > +  DBUG_RETURN(to);
> > +}
> > +
> > +
> > +int ha_cassandra::external_lock(THD *thd, int lock_type)
> > +{
> > +  DBUG_ENTER("ha_cassandra::external_lock");
> > +  DBUG_RETURN(0);
> > +}
> > +
> > +int ha_cassandra::delete_table(const char *name)
> > +{
> > +  DBUG_ENTER("ha_cassandra::delete_table");
> > +  /*
> > +    Cassandra table is just a view. Dropping it doesn't affect the underlying
> > +    column family.
> > +  */
> > +  DBUG_RETURN(0);
> > +}
> 
> more dummies...
> 
handler::delete_table needs this implementation. I've removed all other
dummies, as well as "dummy implementations start/end" comments.
...

> > +static int show_cassandra_vars(THD *thd, SHOW_VAR *var, char *buff)
> > +{
> > +  cassandra_counters_copy= cassandra_counters;
> > +
> > +  var->type= SHOW_ARRAY;
> > +  var->value= (char *) &cassandra_status_variables;
> 
> what's the point? If you don't do anything in this function,
> you can just as easily list all status variables below
> in the func_status[] array.
> 
> > +  return 0;
> > +}
> > +
> > +
> > +struct st_mysql_storage_engine cassandra_storage_engine=
> > +{ MYSQL_HANDLERTON_INTERFACE_VERSION };
> > +
> > +static struct st_mysql_show_var func_status[]=
> > +{
> > +  {"Cassandra",  (char *)show_cassandra_vars, SHOW_FUNC},
> > +  {0,0,SHOW_UNDEF}
> > +};
> > +
> > +maria_declare_plugin(cassandra)
> > +{
> > +  MYSQL_STORAGE_ENGINE_PLUGIN,
> > +  &cassandra_storage_engine,
> > +  "CASSANDRA",
> > +  "Monty Program Ab",
> > +  "Cassandra storage engine",
> > +  PLUGIN_LICENSE_GPL,
> > +  cassandra_init_func,                            /* Plugin Init */
> > +  cassandra_done_func,                            /* Plugin Deinit */
> > +  0x0001,                                       /* version number (0.1) */
> > +  func_status,                                  /* status variables */
> > +  cassandra_system_variables,                     /* system variables */
> > +  "0.1",                                        /* string version */
> > +  MariaDB_PLUGIN_MATURITY_EXPERIMENTAL          /* maturity */
> > +}
> > +maria_declare_plugin_end;

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
Follow ups

Re: MDEV-3792: review of the handler part of the cassandra engine
From: Sergei Golubchik, 2012-12-20
Re: MDEV-3792: review of the handler part of the cassandra engine
From: Sergei Petrunia, 2012-12-20
References

MDEV-3792: review of the handler part of the cassandra engine
From: Sergei Golubchik, 2012-12-14