← Back to team overview

maria-developers team mailing list archive

Re: eebebe8ef75: MDEV-23842 Atomic RENAME TABLE

 

Hi!

>> diff --git a/client/mysqltest.cc b/client/mysqltest.cc
>> index 350c55edee2..133d1f76839 100644
>> --- a/client/mysqltest.cc
>> +++ b/client/mysqltest.cc
>> @@ -8095,9 +8095,10 @@ void handle_error(struct st_command *command,
>>                    const char *err_sqlstate, DYNAMIC_STRING *ds)
>>  {
>>    int i;
>> -
>>    DBUG_ENTER("handle_error");
>>
>> +  var_set_int("$errno", err_errno);
>
>there was $mysql_errno already, was is not something
>you could've used?
>
>We had both errno and mysql_errno in the code.
>It think it is safer to reuse $errno as it is less used.
>Now errno is both latest sys error and lates sql error.
>
>> diff --git a/mysql-test/suite/atomic/rename_case.result b/mysql-test/suite/atomic/rename_case.result
>> new file mode 100644
>> index 00000000000..4b58c555cf2
>> --- /dev/null
>> +++ b/mysql-test/suite/atomic/rename_case.result
>> @@ -0,0 +1,52 @@
>> +create database test2;
>> +#
>> +# Testing rename error in different places
>> +#
>> +create table t1 (a int);
>> +create table T2 (b int);
>> +create table t3 (c int);
>> +create table T4 (d int);
>> +insert into t1 values(1);
>> +insert into T2 values(2);
>> +insert into t3 values(3);
>> +insert into T4 values(4);
>> +create temporary table tmp1 (a int);
>> +create temporary table tmp2 (b int);
>> +create temporary table tmp3 (c int);
>> +create temporary table tmp4 (d int);
>> +insert into tmp1 values(11);
>> +insert into tmp2 values(22);
>> +insert into tmp3 values(33);
>> +insert into tmp4 values(44);
>> +rename table t3 to T4, t1 to t5, T2 to t1, t5 to T2;
>> +ERROR 42S01: Table 'T4' already exists
>> +rename table t1 to t5, t3 to T4, T2 to t1, t5 to T2;
>> +ERROR 42S01: Table 'T4' already exists
>> +rename table t1 to t5, T2 to t1, t3 to T4, t5 to T2;
>> +ERROR 42S01: Table 'T4' already exists
>> +rename table t1 to t5, T2 to t1, t5 to T2, t3 to T4;
>> +ERROR 42S01: Table 'T4' already exists
>> +# Try failed rename using two databases
>> +rename table test.t1 to test2.t5, test.T2 to test.t1, t5 to test.T2;
>> +ERROR 42S02: Table 'test.t5' doesn't exist
>> +select t1.a+T2.b+t3.c+T4.d from t1,T2,t3,T4;
>
>better do
>
>  select t1.a,T2.b,t3.c,T4.d from t1,T2,t3,T4;
>
>same one row, but it can detect when tables are swapped.

My row also detects if tables are swapped and is safe as it always returns
the same result.

The reason it detects swapped tabels is that each table has it's own column
names and you will get an error if the tables would be swapped.

>> --- a/sql/sql_class.h
>> +++ b/sql/sql_class.h
>> @@ -2835,6 +2835,7 @@ class THD: public THD_count, /* this must be first */
>>
>>  #ifndef MYSQL_CLIENT
>>    binlog_cache_mngr *  binlog_setup_trx_data();
>> +  ulonglong binlog_xid;           /* binlog request storing of xid */
>
>"Tell binlog to store thd->query_id in the next Query_log_event"

Changed to:
  /*
    If set, tell binlog to store the value as query 'xid' in the next
    Query_log_event
  */

It is not always we store query_id and even if it would, the value of
binlog_xid is what is stored.

>> +++ b/sql/log_event_client.cc
>> @@ -1822,7 +1822,7 @@ bool Query_log_event::print_query_header(IO_CACHE* file,
>>          my_b_printf(file,
>>                      "\t%s\tthread_id=%lu\texec_time=%lu\terror_code=%d\n",
>>                      get_type_str(), (ulong) thread_id, (ulong) exec_time,
>> -                    error_code))
>> +                    error_code, (ulong) xid))
>
>huh? you didn't add "\nxid=%lu" to the format string

Ouch. Fixed



>>        goto err;
>>    }
>>
>> diff --git a/sql/debug_sync.h b/sql/debug_sync.h
>> index 3b8aa8815e1..4e3e10fcc51 100644
>> --- a/sql/debug_sync.h
>> +++ b/sql/debug_sync.h
>> @@ -53,4 +53,12 @@ static inline bool debug_sync_set_action(THD *, const char *, size_t)
>>  { return false; }
>>  #endif /* defined(ENABLED_DEBUG_SYNC) */
>>
>> +#ifndef DBUG_OFF
>> +void debug_crash_here(const char *keyword);
>> +bool debug_simulate_error(const char *keyword, uint error);
>> +#else
>> +#define debug_crash_here(A) do { } while(0)
>> +#define debug_simulate_error(A, B) 0
>> +#endif
>
>this doesn't belong to debug_sync.h or debug_sync.cc
>
>put it, hmm, may be in sql_priv.h ? or sql_class.h/sql_class.cc

Added it to sql/dbug.cc and sql/debug.h as agreed

>>  #endif /* DEBUG_SYNC_INCLUDED */
>> diff --git a/sql/debug_sync.cc b/sql/debug_sync.cc
>> index 1fbb15592a4..f523e22f6ce 100644
>> --- a/sql/debug_sync.cc
>> +++ b/sql/debug_sync.cc
>> @@ -1628,3 +1628,74 @@ bool debug_sync_set_action(THD *thd, const char *action_str, size_t len)
>>  /* prevent linker/lib warning about file without public symbols */
>>  int debug_sync_dummy;
>>  #endif /* defined(ENABLED_DEBUG_SYNC) */
>> +
>> +
>> +/**
>> +   Debug utility to do crash after a set number of executions
>> +
>> +   The user variable, either @debug_crash_counter or @debug_error_counter,
>
>this is very hackish. better to use @@debug_dbug_counter,
>something like that. also it could be global for an extra benefit
>if you'd need to crash a non-user thread.

As discussed, I don't want or need global counters.  For this to work for
atomic ddl. they have to be local thread variables.

>> diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
>> index cca05f1fcca..0cf037e3a6f 100644
>> --- a/sql/log_event_server.cc
>> +++ b/sql/log_event_server.cc
>> @@ -1294,6 +1294,15 @@ bool Query_log_event::write()
>>      int3store(start, when_sec_part);
>>      start+= 3;
>>    }
>> +
>> +  /* xid's is used with ddl_log handling */
>
>it's not really an XID, it's a query_id or the DDL statetent.
>
>XID is a combination of a 64-byte global transaction identifier and
>a 64-byte branch qualifier.
>
>It would be much less confusing not to pretend that the query_id
>is an "xid" is some shape or form. Better to rename it to, say
>ddl_query_id. Or even ddl_id, if you like it short.

That would be quite hard to do in the original commits as I would have to
fix and recompile every commit. Better to leave it to another day and
do it as a separate commit on top if we so desire.


>> diff --git a/sql/log.cc b/sql/log.cc
>> index 17c599627cb..6e0a9706ec8 100644
>> --- a/sql/log.cc
>> +++ b/sql/log.cc
>> @@ -10654,6 +10669,8 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
>>    {
>>      if (ha_recover(&xids))
>>        goto err2;
>> +    if (ddl_log_close_binlogged_events(&xids))
>
>your xids and normal Xid_log_event xids are from two different
>non-intersecting sets and have even different semantics.
>
>It'd be safer not to have them in the same hash.

I talked with you about this and you wanted to have them in the same hash.
I asked how to generated them and you told me to use thd->query_id.

Ok, I will split them, not a big deal.

>> --- a/sql/sql_view.cc
>> +++ b/sql/sql_view.cc
>> @@ -2231,16 +2234,25 @@ mysql_rename_view(THD *thd,
>>                                     (uchar*)&view_def, view_parameters))
>>      {
>>        /* restore renamed view in case of error */
>> -      rename_in_schema_file(thd, new_db->str, new_name->str, view->db.str,
>> -                            view->table_name.str);
>> +      rename_in_schema_file(thd, new_db->str, new_name->str, old_db->str,
>> +                            old_name->str);
>>        goto err;
>>      }
>> -  } else
>> +  }
>> +  else
>>      DBUG_RETURN(1);
>>
>>    /* remove cache entries */
>> -  query_cache_invalidate3(thd, view, 0);
>> -  sp_cache_invalidate();
>> +  {
>> +    char key[NAME_LEN*2+1], *ptr;
>> +    memcpy(key, old_db->str, old_db->length);
>> +    ptr= key+ old_db->length;
>> +    *ptr++= 0;
>> +    memcpy(key, old_name->str, old_name->length);
>> +    ptr= key+ old_db->length;
>> +    *ptr++= 0;
>
>there's tdc_create_key for that.

Does not work with LEX_CSTRINGs.

At some point we should create another version of td_create_key that works
with LEX_CSTRINGS. However, to do tha we have to fix a lot of functions that
takes db_name and table_name as const char*


>> +    query_cache.invalidate(thd, key, (size_t) (ptr-key), 0);
>> +  }
>>    error= FALSE;

<cut>

>> +enum enum_ddl_log_rename_table_phase {
>> +  DDL_RENAME_PHASE_TRIGGER= 0,
>> +  DDL_RENAME_PHASE_STAT,
>> +  DDL_RENAME_PHASE_TABLE,
>> +};
>> +
>> +/*
>> +  Setting ddl_log_entry.phase to this has the same effect as setting
>> +  action_type to DDL_IGNORE_LOG_ENTRY_CODE
>
>this doesn't make much sense.

It is very practical and thus makes perfect sense

> May be you can add a line or two
>about why it's needed and one cannot simply set action_type
>to DDL_IGNORE_LOG_ENTRY_CODE instead?

Because that it now how the ddl log is supposed to work.
We are never changing the action type for normal entries.
Makes it easier to check and debug things if something goes wrong.

In old code one disabled an entry by storing the maximum phase into a action
entry.  I added this as a maximum phase to make it safer (to not have to
remember where to find the maxium possible phase and to safe guard against
someone adding a new phase).

Anyway, I fixed the comment to explain this:

/*
  Setting ddl_log_entry.phase to this has the same effect as setting
  the phase to the maximum phase (..PHASE_END) for an entry.
*/


>> +*/
>> +
>> +#define DDL_LOG_FINAL_PHASE ((uchar) 0xff)
>>
>>  typedef struct st_ddl_log_entry
>>  {
>> -  const char *name;
>> -  const char *from_name;
>> -  const char *handler_name;
>> -  const char *tmp_name;
>> +  LEX_CSTRING name;
>> +  LEX_CSTRING from_name;
>> +  LEX_CSTRING handler_name;
>> +  LEX_CSTRING tmp_name;
>> +  LEX_CSTRING db;
>> +  LEX_CSTRING from_db;
>> +  LEX_CSTRING from_handler_name;
>> +  uchar uuid[MY_UUID_SIZE];            // UUID for new frm file
>> +
>> +  ulonglong xid;                       // Xid stored in the binary log
>>    uint next_entry;
>> -  uint entry_pos;
>> -  enum ddl_log_entry_code entry_type;
>> +  uint entry_pos;                      // Set by write_dll_log_entry()
>> +  /*
>> +    unique_id can be used to store a unique number to check current state.
>> +    Currently it is used to store new size of frm file or link to a ddl log
>> +    entry.
>
>is that so? I don't see it storing a file version, a new size of the frm or a
>link, I see it stores some kind of a counter to abort recovery when it was
>repeated more than DDL_LOG_MAX_RETRY times

Yes, it is so (for action entries)
For an execute entry it is re-used as a counter. I will update the
documentation for that.

>> +  */
>> +  uint32 unique_id;                    // Unique id for file version
>> +  uint16 flags;                        // Flags unique for each command
>> +  enum ddl_log_entry_code entry_type;  // Set automatically
>>    enum ddl_log_action_code action_type;
>>    /*
>>      Most actions have only one phase. REPLACE does however have two
>> @@ -95,17 +145,52 @@ typedef struct st_ddl_log_memory_entry
>>  } DDL_LOG_MEMORY_ENTRY;
>>
>>
>> +typedef struct st_ddl_log_state
>> +{
>> +  /* List of ddl log entries */
>> +  DDL_LOG_MEMORY_ENTRY *list;
>> +  /* One execute entry per list */
>> +  DDL_LOG_MEMORY_ENTRY *execute_entry;
>> +} DDL_LOG_STATE;
>
>would be nice to have a small comment, explaining what is a "state"
>and how two lists can represent it.
>
>is execute_entry a pointer to a some element from the list?

I added this comment:


/*
  State of the ddl log during execution of a DDL.

  A ddl log state has one execute entry (main entry pointing to the first
  action entry) and many 'action entries' linked in a list in the order
  they should be executed.
  One recovery the log is parsed and all execute entries will be executed.

  All entries are stored as separate blocks in the ddl recovery file.
*/

<cut>

>> diff --git a/sql/sql_rename.cc b/sql/sql_rename.cc
>> index b7aed97a8a2..2b2f32cf126 100644
>> --- a/sql/sql_rename.cc
>> +++ b/sql/sql_rename.cc
>> @@ -186,49 +172,43 @@ bool mysql_rename_tables(THD *thd, TABLE_LIST *table_list, bool silent,
>>        /* Add IF EXISTS to binary log */
>>        thd->variables.option_bits|= OPTION_IF_EXISTS;
>>      }
>> +
>> +    debug_crash_here("ddl_log_rename_before_binlog");
>> +    /*
>> +      Store xid in ddl log and binary log so that we can check on ddl recovery
>> +      if the item is in the binary log (and thus the operation was complete
>> +    */
>> +    thd->binlog_xid= thd->query_id;
>> +    ddl_log_update_xid(&ddl_log_state, thd->binlog_xid);
>>      binlog_error= write_bin_log(thd, TRUE, thd->query(), thd->query_length());
>
>I'd prefer to avoid blowing up THD with arbitrary flags
>that are only needed rarely in one specific place in the code
>to pass some argument few stack frames down.

The other way would be to pass it as an argument to all binlog entries,
but that would be even worse as the usage of binlog_xid is very deep in
the call stack

>Perhaps you can do instead something like
>
>  bool Query_log_event::write()
>  {
>   ...
>   /* xid's is used with ddl_log handling */
>   if (ddl_log_entry_incomplete())

There is nothing of the ddl log stored in the thd, so this would not work.

>   {
>     *start++= Q_XID;
>     int8store(start, thd->query_id);
>     start+= 8;
>   }
>
>or at least you could replace ulonglong binlog_xid
>with a one bit state, like thd->server_status & SERVER_STATUS_IN_DDL
>or with some other bitmap of flags.

As binlog_xid contains the number that should be stored (no, it is not
always query_id) there is no way to do this as a bit.

<cut>

>> -/*
>> -  reverse table list
>> -
>> -  SYNOPSIS
>> -    reverse_table_list()
>> -    table_list pointer to table _list
>> -
>> -  RETURN
>> -    pointer to new (reversed) list
>> -*/
>> -static TABLE_LIST *reverse_table_list(TABLE_LIST *table_list)
>> -{
>> -  TABLE_LIST *prev= 0;
>> -
>> -  while (table_list)
>> -  {
>> -    TABLE_LIST *next= table_list->next_local;
>> -    table_list->next_local= prev;
>> -    prev= table_list;
>> -    table_list= next;
>> -  }
>> -  return (prev);
>> -}
>> -
>> -
>>  static bool
>> -do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table,
>> -                    bool skip_error)
>> +do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table)
>
>I wonder why we never got any warnings about it

Warnings about what?

Fixing a failed rename worked correctly before it server did not go down.
Above I was just reusing the recovery code to do the renames back properly.




>>  {
>>    LEX_CSTRING *new_alias;
>>    DBUG_ENTER("do_rename_temporary");
>> diff --git a/sql/ddl_log.cc b/sql/ddl_log.cc
>> index bad786b3799..777dfdddbc7 100644
>> --- a/sql/ddl_log.cc

<cut>
>> +  DDl recovery after a crash works the following way:
>
>DDL or ddl

Fixed


cut>

>>
>>  st_global_ddl_log global_ddl_log;
>
>static?

Already done in later commits.


<cut>

>> +  Sync the ddl log file.
>> +
>> +  @return Operation status
>> +    @retval FALSE  Success
>> +    @retval TRUE   Error
>> +*/
>> +
>> +static bool ddl_log_sync_file()
>> +{
>> +  DBUG_ENTER("ddl_log_sync_file");
>> +  DBUG_RETURN(mysql_file_sync(global_ddl_log.file_id, MYF(MY_WME)));
>> +}
>
>I don't quite understand that, are there cases when you want to
>sync the ddl log and you *don't know* whether LOCK_gdl is locked or not?
>
>because if you know there should be no lock, you'd better use
>mysql_mutex_assert_not_owner here.

Bad orginal code. See top of my tree how this was cleaned up over time
as I was working on the code.

>
>> +
>> +/* Same as above, but ensure we have the LOCK_gdb locked */
>
>that's an interesting Freudian slip
>
>> +
>> +static bool ddl_log_sync_no_lock()
>> +{
>> +  DBUG_ENTER("ddl_log_sync_no_lock");
>> +
>> +  mysql_mutex_assert_owner(&LOCK_gdl);
>> +  DBUG_RETURN(ddl_log_sync_file());
>> +}
>...
>> +/*
>> +  Store and read strings in ddl log buffers
>> +
>> +  Format is:
>> +    2 byte: length (not counting end \0)
>> +    X byte: string value of length 'length'
>> +    1 byte: \0
>
>a bit redundant, don't you think? Why did you do it?

What is redundant?
If you see the usage of the above, you will see the benfits of this format.

<cut>

>>  static void set_global_from_ddl_log_entry(const DDL_LOG_ENTRY *ddl_log_entry)
>
>it'd be good to have some comment explaining why you need this
>function. Couldn't you write directly from the ddl_log_entry?
>or use a local buffer? why to serialize into a buffer inside global_ddl_log?

Ask Michael Ronström. This is still his logic.
Anyway, it works good enough so I have not tried to change it.
Having a global buffer helps a little bit with not having to do almost
no mallocs when running DDL.s

<cut>

>> +  pos= store_string(pos, end, &ddl_log_entry->tmp_name);
>
>you have DBUG_ASSERT(0) that all those strings won't overflow io_size.
>How can you guarantee that?

The buffer can never overflow, that code guarantees that.
What could in theory happen is the an entry could not fit into
a block, in which case recovery would not work.
However this is very unlikely/impossible as we don't store any big
strings in it.

All 'big' strings (like query) are stored separately.
We are maximum storing 2 db's and 2 table names + internal tmp file
(small string). This would be (64*4)*4  (assuming 4 byte character set),
which is 1024 and the default block is 4K.

<cut>

>> +/*
>> +  Build a filename for a table, trigger file or .frm
>> +  Delete also any temporary file suffixed with ~
>
>Where do you get these *~ files from? What creates them?

Triggers, creation of frm etc.
see sql_create_definition_file().

<cut>


>>    DBUG_PRINT("ddl_log",
>> -             ("write type %c next %u name '%s' from_name '%s' handler '%s'"
>> -              " tmp_name '%s'",
>> +             ("type: %c  next: %u  handler: %s  "
>> +              "to_name: '%s.%s' from_name: '%s.%s' "
>> +              "tmp_name: '%s'",
>>               (char) global_ddl_log.file_entry_buf[DDL_LOG_ACTION_TYPE_POS],
>>               ddl_log_entry->next_entry,
>> -             (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS],
>> -             (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
>> -                                                    + FN_REFLEN],
>> -             (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
>> -                                                    + (2*FN_REFLEN)],
>> -             (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
>> -                                                    + (3*FN_REFLEN)]));
>> +              get_string(&pos, end).str,   // Handler
>> +              get_string(&pos, end).str,   // to db.table
>> +              get_string(&pos, end).str,
>> +              get_string(&pos, end).str,   // From db.table
>> +              get_string(&pos, end).str,
>> +              get_string(&pos, end).str)); // Tmp name
>
>This is not a critical bug, because it's in a DBUG_PRINT,
>but it's still a bug. C++ (and C) standard both say that
>the order of evaluation of function arguments is *unspecified*.

This is already fixed in a later commit.

<cut>

>>  bool ddl_log_write_execute_entry(uint first_entry,
>> -                                 bool complete,
>>                                   DDL_LOG_MEMORY_ENTRY **active_entry)
>>  {
>> -  bool write_header= FALSE;
>> -  char *file_entry_buf= (char*)global_ddl_log.file_entry_buf;
>> +  uchar *file_entry_buf= global_ddl_log.file_entry_buf;
>> +  bool got_free_entry= 0;
>>    DBUG_ENTER("ddl_log_write_execute_entry");
>>
>>    mysql_mutex_assert_owner(&LOCK_gdl);
>> -  if (init_ddl_log())
>> -  {
>> -    DBUG_RETURN(TRUE);
>> -  }
>> -  if (!complete)
>> -  {
>>      /*
>>        We haven't synched the log entries yet, we synch them now before
>>        writing the execute entry. If complete is true we haven't written
>
>there's no 'complete' anymore

Fixed

>
>>        any log entries before, we are only here to write the execute
>>        entry to indicate it is done.
>>      */
>>      (void) ddl_log_sync_no_lock();
>
>Why do you need to sync previous log entries? As far as I understand,
>they're useless without DDL_LOG_EXECUTE_CODE entry, so why would it matter
>whether they're on disk or not after a crash?

Because we have to ensure that when exceute entry is written (and can thus
be executed any time), all ddl action entries must be also on disk.

Thus we have to write and sync ddl action entries before writing the
execute entry.

<cut>

>> +bool ddl_log_close_binlogged_events(HASH *xids)
>> +{
>> +  uint i;
>> +  DDL_LOG_ENTRY ddl_log_entry;
>> +  DBUG_ENTER("ddl_log_close_binlogged_events");
>> +
>> +  if (global_ddl_log.num_entries == 0 || xids->records == 0)
>> +    DBUG_RETURN(0);
>> +
>> +  mysql_mutex_lock(&LOCK_gdl);
>> +  for (i= 1; i <= global_ddl_log.num_entries; i++)
>> +  {
>> +    if (read_ddl_log_entry(i, &ddl_log_entry))
>> +      break;                                    // Read error. Ignore
>
>it's not really "ignore", it's more like "stop reading"

Updated comment.

<cut>

>>      if (ddl_log_entry.entry_type == DDL_LOG_EXECUTE_CODE)
>>      {
>> +      /* purecov: begin tested */
>
>really? you tested with purify?

Didn't you know that mtr when run with gcov, takes the above into account?
Elena has verified that all code in atomic create table was executed,
except for the code that was marked with purify.
The effect of the above that mtr found that "all lines where tested".

<cut>

>> +      /* purecov: end tested */
>>        if (ddl_log_execute_entry_no_lock(thd, ddl_log_entry.next_entry))
>>        {
>> -        /* Real unpleasant scenario but we continue anyways.  */
>> +        /* Real unpleasant scenario but we have to continue anyway  */
>
>why can this happen? do you have an example?
>besides hardware failure, like a disk cannot be read

InnoDB gives an error that it cannot drop a table.
<cut>

>> +  /*
>> +    Create a new ddl_log to get rid of old stuff and ensure that header matches
>> +    the current source version
>> +   */
>
>because here you delete/overwrite the old ddl log, it'd make sense
>to log all details of every entry that you failed to execute but decided
>to continue anyway. Now you log only the entry number on failure, which is
>rather useless, because it's a number in a ddl log that you overwrite.

As we have agreed, I now create a backup file before starting recovery.

<cut>

Regards,
Monty


References