maria-developers team mailing list archive

Thread
Date

On the issue of Seconds_behind_master and Parallel Replication

To: MariaDB Developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 15 Oct 2015 08:16:30 +0200
Cc: Monty Widenius <monty@xxxxxxxxxxx>, Stephane Varoqui <stephane@xxxxxxxxxx>
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux)

It was brought to my attention an issue with parallel replication and the
Seconds_Behind_Master field of SHOW SLAVE STATUS. I have a possible patch
for this, but I wanted to discuss it on the list, as it changes semantics
compared to the non-parallel case.

Each binlog event contains a timestamp (**) of when the event was created on
the master. Whenever the slave SQL thread reads an event from the relay log,
it updates the value of Seconds_Behind_Master to the difference between the
slave's current time and the event's timestamp.

Now in parallel replication, the SQL thread can read a large number of
events from the relay log and queue them in-memory for the worker threads.
So a small value of Seconds_Behind_Master means only that recent events have
been queued - it might still be a long time before the worker threads have
had time to actually execute all the queued events. Apparently the problem
is (justified) user confusion about this queuing delay not being reflected
in Seconds_Behind_Master.

The same problem actually exists in the non-parallel case. In case of a
large transaction, the Seconds_Behind_Master can be small even though there
is still a large amount of execution time remaining for the transaction to
complete on the slave. However, in the non-parallel case, at most one
transaction can be involved. In the parallel case, the problem is amplified
by the potential of thousands of queued transactions awaiting execution.

So how to solve it? Attached is a patch that implements one possible
solution: the Seconds_Behind_Master is only updated after a transaction
commits, with the timestamp of the commit events. This seems more intuitive
anyway. But it does introduce a semantic difference between the non-parallel
and parallel behaviour for Seconds_Behind_Master. The value will in general
be larger on a parallel slave than on a non-parallel slave, for the same
actual slave lag.

Monty suggested changing the behaviour also for non-parallel mode - letting
Seconds_Behind_Master reflect only events actually committed, not just read
from the relay log. This would introduce an incompatible behaviour for
Seconds_Behind_Master, but could perhaps be done for 10.1, if desired. Doing
it in stable 10.0 would be more drastic.

So any opinions on this?

 - Should Seconds_Behind_Master be changed as per above in parallel
   replication (from 10.0 on)?

 - If not, any suggestion for another semantics for Seconds_Behind_Master in
   parallel replication?

 - If so, should the change to Seconds_Behind_Master also be done in the
   non-parallel case in 10.1? What about 10.0?

 - Any comments on the patch?

Here is a test case that shows the problem. It simulates a slave catching up
with a bunch of INSERT(SLEEP(1)). Without my patch, Seconds_Behind_Master=2
just after starting the parallel slave, while the non-parallel slave shows
Seconds_Behind_Master=9. With my patch, the parallel slave shows
Seconds_Behind_Master=11.

-----------------------------------------------------------------------
--source include/have_innodb.inc
--source include/have_binlog_format_statement.inc
--let $rpl_topology=1->2
--source include/rpl_init.inc

# Test various aspects of parallel replication.

--connection server_1
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
CALL mtr.add_suppression("Unsafe statement written to the binary log using statement format");
CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
--save_master_pos

--connection server_2
--sync_with_master
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=1;

--connection server_1
--disable_warnings
INSERT INTO t1 VALUES (1, SLEEP(1));
INSERT INTO t1 VALUES (2, SLEEP(1));
INSERT INTO t1 VALUES (3, SLEEP(1));
INSERT INTO t1 VALUES (4, SLEEP(1));
INSERT INTO t1 VALUES (5, SLEEP(1));
INSERT INTO t1 VALUES (6, SLEEP(1));
INSERT INTO t1 VALUES (7, SLEEP(1));
INSERT INTO t1 VALUES (8, SLEEP(1));
INSERT INTO t1 VALUES (9, SLEEP(1));
INSERT INTO t1 VALUES (10, SLEEP(1));
--disable_warnings
--save_master_pos

--connection server_2
--source include/start_slave.inc
SELECT SLEEP(1);
--let $status_items= Seconds_Behind_Master
--source include/show_slave_status.inc
--sync_with_master
--let $status_items= Seconds_Behind_Master
--source include/show_slave_status.inc


# Clean up.
--connection server_2
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=DEFAULT;
--source include/start_slave.inc

--connection server_1
DROP TABLE t1;

--source include/rpl_end.inc
-----------------------------------------------------------------------

Thanks,

 - Kristian.

** The full detail is a bit more complex. Some events actually contains the
   time when the corresponding query _started_ executing, and another value
   which is the duration of execution. These are added by the slave to get
   the timestamp for the creation of the event (at the end of query
   exection). Also, the slave adjusts the timestamp with any time difference
   between the master and slave clock (eg. time zone for example).

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
index cc5da77..920442c 100644
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@@ -44,6 +44,9 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev,
   rgi->event_relay_log_pos= qev->event_relay_log_pos;
   rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos;
   strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name);
+  if (!(ev->is_artificial_event() || ev->is_relay_log_event() ||
+        (ev->when == 0)))
+    rgi->last_master_timestamp= ev->when + (time_t)ev->exec_time;
   mysql_mutex_lock(&rli->data_lock);
   /* Mutex will be released in apply_event_and_update_pos(). */
   err= apply_event_and_update_pos(ev, thd, rgi, rpt);
diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc
index 8a2a55f..3b2fea9 100644
--- a/sql/rpl_rli.cc
+++ b/sql/rpl_rli.cc
@@ -1001,6 +1001,18 @@ void Relay_log_info::inc_group_relay_log_pos(ulonglong log_pos,
       else if (group_master_log_pos < log_pos)
         group_master_log_pos= log_pos;
     }
+
+    /*
+      In the parallel case, we only update the Seconds_Behind_Master at the
+      end of a transaction. In the non-parallel case, the value is updated as
+      soon as an event is read from the relay log; however this would be too
+      confusing for the user, seeing the slave reported as up-to-date when
+      potentially thousands of events are still queued up for worker threads
+      waiting for execution.
+    */
+    if (rgi->last_master_timestamp &&
+        rgi->last_master_timestamp > last_master_timestamp)
+      last_master_timestamp= rgi->last_master_timestamp;
   }
   else
   {
@@ -1630,6 +1642,7 @@ rpl_group_info::reinit(Relay_log_info *rli)
   row_stmt_start_timestamp= 0;
   long_find_row_note_printed= false;
   did_mark_start_commit= false;
+  last_master_timestamp = 0;
   gtid_ignore_duplicate_state= GTID_DUPLICATE_NULL;
   commit_orderer.reinit();
 }
diff --git a/sql/rpl_rli.h b/sql/rpl_rli.h
index 2d92f38..1e8bb66 100644
--- a/sql/rpl_rli.h
+++ b/sql/rpl_rli.h
@@ -669,6 +669,13 @@ struct rpl_group_info
   char gtid_info_buf[5+10+1+10+1+20+1];
 
   /*
+    The timestamp, from the master, of the commit event.
+    Used to do delayed update of rli->last_master_timestamp, for getting
+    reasonable values out of Seconds_Behind_Master in SHOW SLAVE STATUS.
+  */
+  time_t last_master_timestamp;
+
+  /*
     Information to be able to re-try an event group in case of a deadlock or
     other temporary error.
   */
diff --git a/sql/slave.cc b/sql/slave.cc
index 0243272..89399b4 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -3500,8 +3500,13 @@ static int exec_relay_log_event(THD* thd, Relay_log_info* rli,
       If it is an artificial event, or a relay log event (IO thread generated
       event) or ev->when is set to 0, we don't update the
       last_master_timestamp.
+
+      In parallel replication, we might queue a large number of events, and
+      the user might be surprised to see a claim that the slave is up to date
+      long before those queued events are actually executed.
      */
-    if (!(ev->is_artificial_event() || ev->is_relay_log_event() || (ev->when == 0)))
+    if (opt_slave_parallel_threads == 0 &&
+        !(ev->is_artificial_event() || ev->is_relay_log_event() || (ev->when == 0)))
     {
       rli->last_master_timestamp= ev->when + (time_t) ev->exec_time;
       DBUG_ASSERT(rli->last_master_timestamp >= 0);

Follow ups

Re: On the issue of Seconds_behind_master and Parallel Replication
From: Kristian Nielsen, 2015-10-22
Re: On the issue of Seconds_behind_master and Parallel Replication
From: Jean-François Gagné, 2015-10-16
Re: On the issue of Seconds_behind_master and Parallel Replication
From: Kristian Nielsen, 2015-10-15
Re: On the issue of Seconds_behind_master and Parallel Replication
From: Ian Gilfillan, 2015-10-15
Re: On the issue of Seconds_behind_master and Parallel Replication
From: Eric Herman, 2015-10-15