← Back to team overview

maria-developers team mailing list archive

Re: Incorrect format description event skipping in queue_event()


> So suppose now we have a two-statement transaction:
>     BEGIN; UPDATE t1 SET a=a+1; UPDATE t2 SET b=b-1; COMMIT
> and that slave IO thread reconnects in the middle (between the two updates).
> First we receive the BEGIN and the first UPDATE and write it to relay log 1:
>     Format_Description (slave)
>     Format_Description (master)
>     Gtid_list
>     Binlog_checkpoint
>     GTID BEGIN 0-1-1
>     Query UPDATE t1 SET a=a+1
> Then we disconnect and reconnect and write the remaining events to
> relay log 2:
>     Format_Description (slave)
>     # master format description received but skipped
>     # GTID 0-1-1 skipped
>     # Query UPDATE t1 skipped
>     UPDATE t2 SET b=b-1
> and the SQL thread sees an intact transaction. (I am not sure I got 100% right
> exactly which non-transaction events are written, but this is the overall
> idea).
> And so, the reason it is ok to skip the format description event from the
> master in relay log 2 is that this _only_ happens when we know for sure that
> the same master format description event was already written to an earlier
> relay log and seen by the slave. Or put another way, if we had not reconnected
> to the master at this point, we would not have received the master's format
> description event at this point in the first place, but things would still
> have worked without it obviously.
> Makes sense? Or did I miss some case that could make this cause problems?

Yes, this causes problems because SQL thread sees Format_Description
(slave) from relay log 2 and replaces its description_event_for_exec
with it. So from this moment it continues to parse relay log events
using slave's format description when all events actually created (and
understood by IO thread) using master's format description.

>> When IO thread is reconnecting it rotates relay log and as I said it
>> writes format description event at the beginning of the new file. But
>> it writes an event that it created itself, i.e. not the one that
>> master have sent. And as format description event from master is not
>> written into relay log SQL thread from this point on starts to use
>> format description generated by slave which may be different from the
>> one generated by master. It may lead to a broken replication and SQL
> But this must be the same problem with normal replication? Whenever the slave
> decides to rotate the relay log, it will write a format description event
> created by itself with no following format description created on the
> master. So it seems this must work somehow, though I'll frankly admit I do not
> understand the details of how this works (do you know?)

Yes, I investigated this. During normal replication (when relay log is
rotated automatically due to max_size) slave's format description is
written at the beginning of the new realy log file, but right after
that there's code that if description_event_for_queue->binlog_version
>= 4 then it writes description_event_for_queue into relay log too.
Also it ensures that the event has created = 0 and artificial_event
set to 1. So SQL thread still gets the master's format description and
doesn't rollback the transaction.
When IO thread reconnects to master the first event it receives is
Rotate. For Rotate event queue_event() executes process_io_rotate().
Inside there if
mi->rli.relay_log.description_event_for_queue->binlog_version >= 4 it
forcefully replaces description_event_for_queue with new event with
binlog_version = 3. Then it does the actual relay log rotation during
which description_event_for_queue is not written into the new log file
(and it shouldn't as it's not master's at this point anyway). The next
event IO thread receives is Format_Description. If it's written to
relay log at this point then SQL thread will proceed using master's
format description. And I believe I tested the situation when IO
thread reconnects in the middle of transaction and writes format
description from master (I've commented those lines you pointed to)
and I didn't see that transaction rolled back, everything was fine.
Maybe I messed up with testing, I'll try to retest again...

>> Another somewhat related question: Gtid_log_event::peek() (as well as
>> Gtid_log_event constructor from const char* buf) is implemented with
>> assumption that Format_description_log_event::common_header_len is
>> always equal to LOG_EVENT_HEADER_LEN. While currently it's true I
> Agree, it looks like a bug.
> Do you have the possibility to help with this? It is a bit hard for me to test
> such a fix as I do not have an easy way to generate binlogs with different
> header lengths, but I think perhaps that your team has such capability?

I can try, although first quick look showed that it can be tricky:
peek() is used not only in IO thread, but during scanning binlogs too.
I didn't look yet if code has correct description_event available at
this point.


Follow ups