← Back to team overview

maria-developers team mailing list archive

Incorrect format description event skipping in queue_event()

 

Kristian,

You have the following comment in the queue_event() in sql/slave.cc:

    /*
      Do not queue any format description event that we receive after a
      reconnect where we are skipping over a partial event group received
      before the reconnect.

      (If we queued such an event, and it was the first format_description
      event after master restart, the slave SQL thread would think that
      the partial event group before it in the relay log was from a
      previous master crash and should be rolled back).
    */

I don't understand which failure scenario you are talking about here
and I claim that this bypassing of queuing into relay log is
incorrect.

My reasoning is: relay log can always be rotated in the middle of
event group. During rotation when new relay log file is created format
description event is always written at the beginning of the file. So
in this case SQL thread will see format description event in the
middle of event group and according to my testing it doesn't rollback
the active transaction because of that. So when do you think it will
rollback?

Now, you may say that there's no problem in skipping the format
description event because everything works. But it's only for now.
When IO thread is reconnecting it rotates relay log and as I said it
writes format description event at the beginning of the new file. But
it writes an event that it created itself, i.e. not the one that
master have sent. And as format description event from master is not
written into relay log SQL thread from this point on starts to use
format description generated by slave which may be different from the
one generated by master. It may lead to a broken replication and SQL
thread may not recognize events in the relay log if difference between
master's and slave's format descriptions is significant.

Another somewhat related question: Gtid_log_event::peek() (as well as
Gtid_log_event constructor from const char* buf) is implemented with
assumption that Format_description_log_event::common_header_len is
always equal to LOG_EVENT_HEADER_LEN. While currently it's true I
believe common_header_len can be increased in future MariaDB versions.
And when it happens old slave won't be able to replicate from a new
master. I'd think this is a bug. What do you think?


Pavel


Follow ups