maria-developers team mailing list archive

Thread
Date
Re: MDEV-181: XID crash recovery across binlog boundaries

To: Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 12 Jun 2012 14:51:22 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20120530161930.GA12250@meddwl.fritz.box> (Sergei Golubchik's message of "Wed, 30 May 2012 18:19:30 +0200")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Here's my review.

Thanks!

I've implemented most of your suggestions, below find explanations for when I
differed, and answers to questions.

 - Kristian.

>> === added file 'mysql-test/suite/binlog/t/binlog_xa_recover.test'

For MDEV-232, which I worked on during review of MDEV-181, I needed to revork
this test case extensively. So I hope it is ok that I implemented your
suggestions only in the MDEV-232 patch.

>> +# Transactions are not guaranteed stored durably on disk in the engine until
>> +# they are fsync()ed, which normally happens during commit(). But there is no
>> +# guarantee that they will _not_ be durable, in particular loosing results
>> +# of a write(2) system call normally requires a kernel crash (as opposed to
>> +# just mysqld crash), which is inconvenient to do in a test suite.
>> +# So instead we do an error insert to prevent commit_ordered() from being
>> +# called in the engine - so nothing will be written to disk at all, and crash
>> +# recovery is sure to be needed.
>
> Are these assumptions reasonable? You seem to assume that without
> commit_ordered the engine will still work normally. but simply won't
> write changes to disk. Is it always the case?

It is not always the case, it is specific to the existing code in InnoDB. This
test was rather tricky to make due to the locking - It is not possible to
rotate the binlog until commit_ordered() has run (an alternative would be a
DBUG option that skipped write() calls inside innodb to simulate write lost in
power failure).

But it does not matter so much - because for MDEV-232 this issue is gone. With
MDEV-232, commit_ordered does not write anything to disk, so the test does not
have to error insert to disable commit_ordered().

>> +SET @@global.debug_dbug='+d,skip_commit_ordered';
>
> I don't see where you restore the old value of debug_dbug

I do not need to, as the server is crashed, and the value is restored by
server restart. I added a comment.

>> +INSERT INTO t1 VALUES (0, REPEAT("x", 4100));
>
> how comes this insert is not affected by skip_commit_ordered ?

I think it is affected - but again, skip_commit_ordered is gone after
MDEV-232, so we can ignore this issue.

>> +--echo No XIDs are pending, a new binlog checkpoint should have been logged.
>
> Could you also test recovery with multiple checkpoints in a binlog?
> E.g. binlog, like the above, then crash, and see that it uses
> the *last* checkpoint in the binlog.

Done (for MDEV-232).

>> === modified file 'sql/log.h'
>> --- sql/log.h	2012-01-16 19:16:35 +0000
>> +++ sql/log.h	2012-05-14 09:51:40 +0000
>> @@ -348,6 +348,15 @@
>>    time_t last_time;
>>  };
>>  
>> +/*
>> +  We assign each binlog file an internal ID, used to identify them for unlog().
>> +  Ids start from BINLOG_COOKIE_START; the value BINLOG_COOKIE_DUMMY is special
>> +  meaning "no binlog" (we cannot use zero as that is reserved for error return
>> +  from log_and_order).
>> +*/
>> +#define BINLOG_COOKIE_DUMMY 1
>> +#define BINLOG_COOKIE_START 2
>
> Are these values completely internal and can be changed in any minor release without
> any compatibility concerns? Or they're written to binlog or something?

They are completely internal and can be changed at will.

(An alternative could be to change cookie in handler.cc to a void * and pass
around a pointer directly to the list entries, eliminating the need for this
identifier. But I think the current code is fine).

>> +  I_List<xid_count_per_binlog> binlog_xid_count_list;
>
> Why I_List?

I am not sure I understand the question?

I actually wanted to just code a list directly with `next' pointer
manipulations. But then I thought I should re-use an existing list
implementation. And I looked around and the I_List in sql_list.h was the one
that seemed best suited to what I needed.

Do you have an alternative suggestion to using an I_List?

> It'd be great, if your diff would include function names (diff -p).
> An easy way of doing it is to use my diff_p plugin, see the last section in
> http://kb.askmonty.org/en/how-to-get-more-out-of-bzr-when-working-on-mariadb
> it's the most flexible, robust and universal solution that I have, and it
> puts function names in all bzr-generated diffs, not only in "bzr diff" command.

Ok, I set it up, seems to work, thanks.

>> === modified file 'sql/log.cc'
>> --- sql/log.cc	2012-04-10 06:28:13 +0000
>> +++ sql/log.cc	2012-05-14 09:51:40 +0000

>> @@ -3506,6 +3587,42 @@
>>    mysql_mutex_lock(&LOCK_log);
>>    mysql_mutex_lock(&LOCK_index);
>>  
>> +  if (!is_relay_log)
>> +  {
>> +    /*
>> +      We are going to nuke all binary log files.
>> +      So first wait until all pending binlog checkpoints have completed.
>> +    */
>> +    mysql_mutex_lock(&LOCK_xid_list);
>> +    xid_count_per_binlog *b;
>> +    reset_master_pending= true;
>> +    for (;;)
>> +    {
>> +      I_List_iterator<xid_count_per_binlog> it(binlog_xid_count_list);
>> +      while ((b= it++))
>> +      {
>> +        if (b->xid_count > 0)
>> +          break;
>> +      }
>> +      if (!b)
>> +        break;                                  /* No more pending XIDs */
>> +      /*
>> +        Wait until signalled that one more binlog dropped to zero, then check
>> +        again.
>> +      */
>> +      mysql_cond_wait(&COND_xid_list, &LOCK_xid_list);
>> +    }
>
> I would've done it simpler, like
>
> while (b= binlog_xid_count_list.head())
> {
>    while (b->xid_count > 0)
>      mysql_cond_wait(&COND_xid_list, &LOCK_xid_list);
>    my_free(binlog_xid_count_list.get());
> }

Right.

The code is somewhat different in MDEV-232 - and I managed to simplify the
reset_master_pending logic a bit (I was a bit unhappy about the additional
complexity just for RESET MASTER). So I did not quite follow your suggestion,
but close enough.

>> +void
>> +MYSQL_BIN_LOG::write_binlog_checkpoint_event_already_locked(const char *name,
>> +                                                            uint len)
>> +{
>> +  Binlog_checkpoint_log_event ev(name, len);
>> +  /*
>> +    Note that we must sync the binlog checkpoint to disk.
>> +    Otherwise a subsequent log purge could delete binlogs that XA recovery
>> +    thinks are needed (even though they are not really).
>> +  */
>> +  if (!ev.write(&log_file) && !flush_and_sync(0))
>> +  {
>> +    bool check_purge= false;
>> +    signal_update();
>> +    rotate(false, &check_purge);
>> +    if (check_purge)
>> +      purge();
>
> why "write binlog checkpoint event" implies signal_update(), rotate(), and purge() ?

The signal_update() is needed so that slave connections will wake up and send
the new data to slaves (well, the checkpoint event is not particularly useful
for the slave, but it might be doing START SLAVE UNTIL <end of checkpoint
event>.)

The rotate (and purge) - I decided to omit those. A checkpoint event is only
generated as the result of a rotate - so does not feel right that it could
cause another rotate.

>> @@ -6061,11 +6207,11 @@
>>        {
>>          /*
>>            If we fail to rotate, which thread should get the error?
>> -          We give the error to the *last* transaction thread; that seems to
>> -          make the most sense, as it was the last to write to the log.
>> +          We give the error to the leader, as any my_error() thrown inside
>> +          rotate() will have been registered for the leader THD.
>>          */
>> -        last_in_queue->error= ER_ERROR_ON_WRITE;
>> -        last_in_queue->commit_errno= errno;
>> +        leader->error= ER_ERROR_ON_WRITE;
>> +        leader->commit_errno= errno;
>
> Where did that change happen? Is it related to MDEV-181 at all, or it's an unrelated bugfix?

Hm. I did this to make error handling correct with my changes. But now that
you mention it, there may be an error in the original code here. The error
would be that two transactions T1 (leader) and T2 group commit together, and
rotate fails. Then T2 would call my_error() and return failure to client. But
T1 would also have called my_error(), but return success to client (and cause
an assert in debug build).

IIRC, this turned up because I changed my implementation back and forth a few
times with respect to rotate - and perhaps in one of the "back" this problem
occured to me. So I did not think of it as an unrelated bugfix, but it looks
like you are right that it is.

>> @@ -6082,9 +6228,6 @@
>>    */
>>    mysql_mutex_unlock(&LOCK_log);
>>  
>> -  if (check_purge) 
>> -    purge();
>
> Why was it moved down?

Seems this is another change unrelated to MDEV-181. As I was working on the
code, I realised that this check_purge (which can be expensive) was done under
LOCK_commit_ordered (a lock that should only be held for short time), but did
not need to. So I moved it down to after LOCK_commit_ordered is released.

>> +  if (!reset_master_pending)
>> +  {
>> +    mysql_mutex_unlock(&LOCK_xid_list);
>> +    mysql_mutex_lock(&LOCK_log);
>> +    mysql_mutex_lock(&LOCK_xid_list);
>> +  }
>> +  for (;;)
>> +  {
>> +    /* Remove initial element(s) with zero count. */
>> +    b= binlog_xid_count_list.get();
>
> Eh. You need to check whether
> binlog_xid_count_list.head()->xid_count == 0 before removing it like that.
> When you released LOCK_xid_list another thread could've taken over and
> deleted all zero count elements. So now you could find that
> binlog_xid_count_list.head() has non-zero count.

Oops!

Nice catch, thanks!

>> === modified file 'sql/sql_repl.cc'
>> --- sql/sql_repl.cc	2012-04-27 08:20:38 +0000
>> +++ sql/sql_repl.cc	2012-05-14 09:51:40 +0000
>> @@ -624,6 +624,30 @@
>>    }
>>  
>>    /*
>> +    Do not send binlog checkpoint events to a slave that does not understand it.
>> +  */
>> +  if (unlikely(event_type == BINLOG_CHECKPOINT_EVENT) &&
>> +      mariadb_slave_capability < MARIA_SLAVE_CAPABILITY_BINLOG_CHECKPOINT)
>> +  {
>> +    if (mariadb_slave_capability >= MARIA_SLAVE_CAPABILITY_TOLERATE_HOLES)
>> +    {
>> +      /* This slave can tolerate events omitted from the binlog stream. */
>> +      return NULL;
>> +    }
>> +    else
>> +    {
>> +      /*
>> +        The slave does not understand BINLOG_CHECKPOINT_EVENT. Send a dummy
>> +        event instead, with same length so slave does not get confused about
>> +        binlog positions.
>> +      */
>
> Hm. Above and below the code that you've added events are simply skipped,
> without checks and dummy events. Why do you bother to replace with dummies?

Hm, there should be checks above and below also.

Note that this builds on MDEV-225. Before MDEV-225, annotate events are indeed
skipped on the master - however this does not work with older slaves - they
will get confused over binlog offsets and replication will eventually
break. But MDEV-225 fixes this by replacing with dummy events if the slave
needs it. (I'm speculating that you were looking at plain MariaDB 5.5 code
without MDEV-225, where there are no checks).

And below - we skip events marked with LOG_EVENT_SKIP_REPLICATION_F. But this
skip only happens if the slave explicitly asked for them to be skipped - so we
know that skipping is ok.

In principle these checks are not needed - officially replicating new master
to old slave is not supported. However, binlog checkpoint events will be in
_all_ master binlogs - they are not something optional that is off by
default. So I wanted to do the extra effort to preserve that replication to
old slave works nevertheless.
References

Re: MDEV-181: XID crash recovery across binlog boundaries
From: Sergei Golubchik, 2012-05-30