maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #06152
Re: Incorrect format description event skipping in queue_event()
Pavel Ivanov <pivanof@xxxxxxxxxx> writes:
>> But this must be the same problem with normal replication? Whenever the slave
>> decides to rotate the relay log, it will write a format description event
>> created by itself with no following format description created on the
>> master. So it seems this must work somehow, though I'll frankly admit I do not
>> understand the details of how this works (do you know?)
>
> Yes, I investigated this. During normal replication (when relay log is
> rotated automatically due to max_size) slave's format description is
> written at the beginning of the new realy log file, but right after
> that there's code that if description_event_for_queue->binlog_version
>>= 4 then it writes description_event_for_queue into relay log too.
> Also it ensures that the event has created = 0 and artificial_event
> set to 1. So SQL thread still gets the master's format description and
> doesn't rollback the transaction.
> When IO thread reconnects to master the first event it receives is
> Rotate. For Rotate event queue_event() executes process_io_rotate().
> Inside there if
> mi->rli.relay_log.description_event_for_queue->binlog_version >= 4 it
> forcefully replaces description_event_for_queue with new event with
> binlog_version = 3. Then it does the actual relay log rotation during
> which description_event_for_queue is not written into the new log file
> (and it shouldn't as it's not master's at this point anyway). The next
I see.
So one possible solution is to do the same at the reconnect case as what we do
in relay-log rotate initiated by slave due to size: Write out the
description_event_for_queue to the relay log with created=0 and artificial=1.
I have attached a patch for this, what do you think?
Do you have the possibility to test if this works (eg. when we get a reconnect
when the master's description event is incompatible with the slave's)?
[This code really is criminally ugly, even for replication standard. But I do
not really know how to fix it in any reasonable way :-(]
> Maybe I messed up with testing, I'll try to retest again...
FYI: I took unmodified 10.0-base, commented out those two lines, ran
rpl.rpl_gtid_reconnect, and it failed.
Thanks,
- Kristian.
=== modified file 'sql/log.cc'
--- sql/log.cc 2013-08-23 12:02:13 +0000
+++ sql/log.cc 2013-08-29 13:04:28 +0000
@@ -4806,12 +4806,23 @@ int MYSQL_BIN_LOG::new_file_impl(bool ne
}
-bool MYSQL_BIN_LOG::append(Log_event* ev)
+bool
+MYSQL_BIN_LOG::append(Log_event *ev)
{
- bool error = 0;
+ bool res;
mysql_mutex_lock(&LOCK_log);
+ res= append_no_lock(ev);
+ mysql_mutex_unlock(&LOCK_log);
+ return res;
+}
+
+
+bool MYSQL_BIN_LOG::append_no_lock(Log_event* ev)
+{
+ bool error = 0;
DBUG_ENTER("MYSQL_BIN_LOG::append");
+ mysql_mutex_assert_owner(&LOCK_log);
DBUG_ASSERT(log_file.type == SEQ_READ_APPEND);
/*
Log_event::write() is smart enough to use my_b_write() or
@@ -4829,7 +4840,6 @@ bool MYSQL_BIN_LOG::append(Log_event* ev
if (my_b_append_tell(&log_file) > max_size)
error= new_file_without_locking();
err:
- mysql_mutex_unlock(&LOCK_log);
signal_update(); // Safe as we don't call close
DBUG_RETURN(error);
}
=== modified file 'sql/log.h'
--- sql/log.h 2013-08-23 12:02:13 +0000
+++ sql/log.h 2013-08-29 13:05:53 +0000
@@ -712,6 +712,7 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
*/
bool appendv(const char* buf,uint len,...);
bool append(Log_event* ev);
+ bool append_no_lock(Log_event* ev);
void mark_xids_active(ulong cookie, uint xid_count);
void mark_xid_done(ulong cookie, bool write_checkpoint);
=== modified file 'sql/slave.cc'
--- sql/slave.cc 2013-08-23 08:16:43 +0000
+++ sql/slave.cc 2013-08-29 13:09:31 +0000
@@ -5308,6 +5308,26 @@ static int queue_event(Master_info* mi,c
if (unlikely(gtid_skip_enqueue))
{
mi->master_log_pos+= inc_pos;
+ if ((uchar)buf[EVENT_TYPE_OFFSET] == FORMAT_DESCRIPTION_EVENT &&
+ s_id == mi->master_id)
+ {
+ /*
+ If we write this master's description event in the middle of an event
+ group due to GTID reconnect, SQL thread will think that master crashed
+ in the middle of the group and roll back the first half, so we must not.
+
+ But we still have to write an artificial copy of the masters description
+ event, to override the initial slave-version description event so that
+ SQL thread has the right information for parsing the events it reads.
+ */
+ rli->relay_log.description_event_for_queue->created= 0;
+ rli->relay_log.description_event_for_queue->set_artificial_event();
+ if (rli->relay_log.append_no_lock
+ (rli->relay_log.description_event_for_queue))
+ error= ER_SLAVE_RELAY_LOG_WRITE_FAILURE;
+ else
+ rli->relay_log.harvest_bytes_written(&rli->log_space_total);
+ }
}
else
if ((s_id == global_system_variables.server_id &&
Follow ups
References