← Back to team overview

maria-developers team mailing list archive

Re: Incorrect format description event skipping in queue_event()


Pavel Ivanov <pivanof@xxxxxxxxxx> writes:

> You have the following comment in the queue_event() in sql/slave.cc:
>     /*
>       Do not queue any format description event that we receive after a
>       reconnect where we are skipping over a partial event group received
>       before the reconnect.
>       (If we queued such an event, and it was the first format_description
>       event after master restart, the slave SQL thread would think that
>       the partial event group before it in the relay log was from a
>       previous master crash and should be rolled back).
>     */
> I don't understand which failure scenario you are talking about here
> and I claim that this bypassing of queuing into relay log is
> incorrect.

It is this code, in Format_description_log_event::do_apply_event():

    As a transaction NEVER spans on 2 or more binlogs:
    if we have an active transaction at this point, the master died
    while writing the transaction to the binary log, i.e. while
    flushing the binlog cache to the binlog. XA guarantees that master has
    rolled back. So we roll back.
    Note: this event could be sent by the master to inform us of the
    format of its binlog; in other words maybe it is not at its
    original place when it comes to us; we'll know this by checking
    log_pos ("artificial" events have log_pos == 0).
  if (!is_artificial_event() && created && thd->transaction.all.ha_list)
    /* This is not an error (XA is safe), just an information */
    rli->report(INFORMATION_LEVEL, 0,
                "Rolling back unfinished transaction (no COMMIT "
                "or ROLLBACK in relay log). A probable cause is that "
                "the master died while writing the transaction to "
                "its binary log, thus rolled back too."); 
    const_cast<Relay_log_info*>(rli)->cleanup_context(thd, 1);

Try it, comment out these the lines in slave.cc

    if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))
        gtid_skip_enqueue= true;

and run the rpl.rpl_gtid_reconnect test. It will fail because the above code
rolls back part of a transaction.

So let me first explain the background for the reconnect-handling code, and
then why I think the skipping you pointed out is ok. And then if you still
think there is an issue, please be persistent - this is a really tricky part
of the replication code that is very hard to get right, so there could
definitely still be bugs.

The issue is when the slave IO thread reconnects to the master. Maybe because
of intermittent network issue or master shutdown, or could be explicit STOP
SLAVE IO_THREAD. So the SQL thread is running while the reconnect happens.

If the reconnect happens in the middle of an event group, we have half a group
written to the relay log and potentially already partially executed by the SQL
thread. So we need to write the second half of the event group to the
relaylog, no more and no less. However, since we connect at a GTID position,
we will receive from the master all the events from the start of the event

So what we do is that we remember how many events from the event group we
already wrote to the relay log before the disconnect. And after reconnect we
then skip that number of events before writing the rest of the event group to
the relay log.

[Note that for reconnect after full STOP SLAVE, there is no issue, as in that
case we clear all relay logs and start over from a well-defined GTID position.
BTW, this is Bug#69758 in the MySQL 5.6 GTID implementation. Basically, the
developers totally ignored the slave reconnect issue ... :-/]

So suppose now we have a two-statement transaction:

    BEGIN; UPDATE t1 SET a=a+1; UPDATE t2 SET b=b-1; COMMIT

and that slave IO thread reconnects in the middle (between the two updates).

First we receive the BEGIN and the first UPDATE and write it to relay log 1:

    Format_Description (slave)
    Format_Description (master)
    GTID BEGIN 0-1-1
    Query UPDATE t1 SET a=a+1

Then we disconnect and reconnect and write the remaining events to
relay log 2:

    Format_Description (slave)
    # master format description received but skipped
    # GTID 0-1-1 skipped
    # Query UPDATE t1 skipped
    UPDATE t2 SET b=b-1

and the SQL thread sees an intact transaction. (I am not sure I got 100% right
exactly which non-transaction events are written, but this is the overall

And so, the reason it is ok to skip the format description event from the
master in relay log 2 is that this _only_ happens when we know for sure that
the same master format description event was already written to an earlier
relay log and seen by the slave. Or put another way, if we had not reconnected
to the master at this point, we would not have received the master's format
description event at this point in the first place, but things would still
have worked without it obviously.

Makes sense? Or did I miss some case that could make this cause problems?

> When IO thread is reconnecting it rotates relay log and as I said it
> writes format description event at the beginning of the new file. But
> it writes an event that it created itself, i.e. not the one that
> master have sent. And as format description event from master is not
> written into relay log SQL thread from this point on starts to use
> format description generated by slave which may be different from the
> one generated by master. It may lead to a broken replication and SQL

But this must be the same problem with normal replication? Whenever the slave
decides to rotate the relay log, it will write a format description event
created by itself with no following format description created on the
master. So it seems this must work somehow, though I'll frankly admit I do not
understand the details of how this works (do you know?)

> Another somewhat related question: Gtid_log_event::peek() (as well as
> Gtid_log_event constructor from const char* buf) is implemented with
> assumption that Format_description_log_event::common_header_len is
> always equal to LOG_EVENT_HEADER_LEN. While currently it's true I

Agree, it looks like a bug.

Do you have the possibility to help with this? It is a bit hard for me to test
such a fix as I do not have an easy way to generate binlogs with different
header lengths, but I think perhaps that your team has such capability?

 - Kristian.

Follow ups