← Back to team overview

maria-developers team mailing list archive

Problem that some variables can be changed without stopping the slave IO thread

 

Monty,

This patch of yours introduced a change where before both slave threads must
be stopped to change @@replicate_events_marked_for_skip, but after only the
SQL thread is checked:

-----------------------------------------------------------------------
commit 572560f38c248d5020f0e63aeb3f8905cd568208
Author: Michael Widenius <monty@xxxxxxxxxxxx>
Date:   Wed Oct 3 01:44:54 2012 +0300

    Changed SHOW_FUNC variabels that don't return SHOW_ARRAY to SHOW_SIMPLE_FUNC.


+   Master_info_index::give_error_if_slave_running()
+
+   @return
+   TRUE        If some slave is running.  An error is printed
+   FALSE       No slave is running
+*/
+
+bool Master_info_index::give_error_if_slave_running()
+{
+  DBUG_ENTER("warn_if_slave_running");
+  mysql_mutex_assert_owner(&LOCK_active_mi);
+
+  for (uint i= 0; i< master_info_hash.records; ++i)
+  {
+    Master_info *mi;
+    mi= (Master_info *) my_hash_element(&master_info_hash, i);
+    if (mi->rli.slave_running != MYSQL_SLAVE_NOT_RUN)
+    {
+      my_error(ER_SLAVE_MUST_STOP, MYF(0), (int) mi->connection_name.length,
+               mi->connection_name.str);
+      DBUG_RETURN(TRUE);
+    }
+  }
+  DBUG_RETURN(FALSE);
+}


bool
 Sys_var_replicate_events_marked_for_skip::global_update(THD *thd, set_var *var)
...
-  /* Slave threads must be stopped to change the variable. */
+  mysql_mutex_unlock(&LOCK_global_system_variables);
   mysql_mutex_lock(&LOCK_active_mi);
-  lock_slave_threads(active_mi);
-  init_thread_mask(&thread_mask, active_mi, 0 /*not inverse*/);
-  if (thread_mask) // We refuse if any slave thread is running
-  {
-    my_error(ER_SLAVE_MUST_STOP, MYF(0));
-    result= true;
-  }
-  else
+  if (!master_info_index->give_error_if_slave_running())
-----------------------------------------------------------------------

So instead of checking both IO and SQL thread, now it checks only SQL thread
(mi->rli.slave_running). Why was this change made? Was it by mistake?

Unfortunately, I did not notice this before, so I subsequently used the same
code (give_error_if_slave_running()) for the following variables:

  gtid_slave_pos
  slave_parallel_threads
  slave_domain_parallel_threads
  replicate_events_marked_for_skip
  gtid_ignore_duplicates

This means that all of the above variables can be changed while the IO
thread is still running. This was never the intention! There are some
complex interactions between the master, the IO thread, and the SQL thread,
particularly with respect to the slave's GTID position, and parts of the
code are specifically written on the assumption that certain things cannot
change without first stopping all replication threads. (I do not have the
full overview of exactly what can break, I think that would require a
careful review of large parts of the code, but I fear that it can corrupt
not just the replication GTID position and possibly even internal server
structures).

So I am wondering what to do ... it is easy to fix
give_error_if_slave_running() to check for both threads, but this is a
change in behaviour (even if it should match documentation) that could
easily break administration scripts, eg. after a 10.0 automatic security-fix
upgrade. I also know that eg. Jean François has written about how he uses
the feature of being able to not stop the IO thread in some cases.

On the other hand, this can cause serious problems for users, as they have
no real way of knowing not to attempt to change these variables without
stopping IO thread first, when there is no error given. MDEV-9138 describes
a complex scenario where slave position gets corrupted from this.

Any suggestions?

Thanks,

 - Kristian.