← Back to team overview

maria-developers team mailing list archive

Re: [knielsen@xxxxxxxxxxxxxxx] [Commits] Rev 3459: MDEV-232: Remove one fsync() from commit phase. in http://bazaar.launchpad.net/~maria-captains/maria/5.5


Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Hi, Kristian!
> On Jun 22, Kristian Nielsen wrote:
>> Can you review below patch for mdev-232 (At your convenience, not urgent)?
> Here's the review. Mostly questions, I have to confess :)

Thanks a lot for your review!

I've tried to answer all your questions below, and for many I have tried to
clarify the patch, as noted.

Do you want to see the changes for when I implement async
commit_checkpoint_request() in InnoDB/XtraDB, before I push it to 10.0?

>> === modified file 'sql/slave.cc'
>> --- a/sql/slave.cc	2012-06-22 09:40:40 +0000
>> +++ b/sql/slave.cc	2012-06-22 10:20:29 +0000
>> @@ -1751,13 +1751,12 @@ when it try to get the value of TIME_ZON
>>    /* Announce MariaDB slave capabilities. */
>>    DBUG_EXECUTE_IF("simulate_slave_capability_none", goto after_set_capability;);
>>    {
>> -    const char *q=
>> -      DBUG_EVALUATE_IF("simulate_slave_capability_old_53",
>> -                       "SET @mariadb_slave_capability="
>> -                       "SET @mariadb_slave_capability="
>> -                           STRINGIFY_ARG(MARIA_SLAVE_CAPABILITY_MINE));
>> -    if (mysql_real_query(mysql, q, strlen(q)))
>> +    int rc= DBUG_EVALUATE_IF("simulate_slave_capability_old_53",
>> +        mysql_real_query(mysql, STRING_WITH_LEN("SET @mariadb_slave_capability="
>> +        mysql_real_query(mysql, STRING_WITH_LEN("SET @mariadb_slave_capability="
>> +                         STRINGIFY_ARG(MARIA_SLAVE_CAPABILITY_MINE))));
>> +    if (rc)
> Why?

Monty asked me to not use strlen() which I had introduced due to a request
from you in review of an earlier worklog. And you wrote that you did not care
much one way or the other (neither do I), so I reverted the change back to the
original without strlen().

(Maybe I should have done it as a separate changeset, but I took this short
cut since MDEV 225/181/232 are kind of tied together and will likely be pushed

>> === modified file 'storage/xtradb/handler/ha_innodb.cc'
>> --- a/storage/xtradb/handler/ha_innodb.cc	2012-06-15 12:54:23 +0000
>> +++ b/storage/xtradb/handler/ha_innodb.cc	2012-06-22 10:20:29 +0000
>> @@ -485,8 +486,11 @@ static MYSQL_THDVAR_ULONG(lock_wait_time
>>  static MYSQL_THDVAR_ULONG(flush_log_at_trx_commit, PLUGIN_VAR_OPCMDARG,
>>    "Set to 0 (write and flush once per second),"
>>    " 1 (write and flush at each commit)"
>> -  " or 2 (write at commit, flush once per second).",
>> -  NULL, NULL, 1, 0, 2, 0);
>> +  " or 2 (write at commit, flush once per second)."
>> +  " Can also be set to 3 to write and flush at each commit, even when this is"
>> +  " unnecessary because of binlog crash recovery; should not normally be used"
>> +  " due to performance degration, but provided for completeness.",
> would you like to improve the description on =1 value? For example,
> 1 (write and flush on prepare or on commit)
> 3 (write and flush both on prepare and on commit, usually it is redundant)

Yes, good idea. Here is my suggestion:

  "Controls the durability/speed trade-off for commits."
  " Set to 0 (write and flush redo log to disk only once per second),"
  " 1 (flush to disk at each commit),"
  " 2 (write to kernel at commit but flush to disk only once per second)"
  " or 3 (flush to disk twice per commit, usually redundant)."
  " 1 and 3 guarantees that after a crash, committed transactions will"
  " not be lost and will be consistent with the binlog and other transactional"
  " engines. 2 can get inconsistent and lose transactions if there is a"
  " power failure or kernel crash but not if mysqld crashes. 0 has no"
  " guarantees in case of crash. 0 and 2 can be faster than 1 or 3.",

>>  /*****************************************************************//**
>> +Handle a commit checkpoint request from server layer.
>> +We simply flush the redo log immediately and do the notify call.*/
>> +static
>> +void
>> +innobase_checkpoint_request(
>> +        handlerton *hton,
>> +        void *cookie)
>> +{
>> +        log_buffer_flush_to_disk();
>> +        ha_commit_checkpoint_notify(hton, cookie);
>> +}
>> +
>> +/*****************************************************************//**
> What about making innobase_checkpoint_request() implementation that flushed
> in the background? At least to serve as an example of the correct usage of your
> API, if not for performance. And to test your code for cases when the notification
> isn't coming directly from the checkpoint_request method.
> it looks like it's very easy to make innodb to invoke ha_commit_checkpoint_notify
> from the background flusher thread (the one that syncs once a second).
> you only need to store the cookie in the log_sys, and invoke
> ha_commit_checkpoint_notify after the flush when this stored cookie isn't NULL.

Yes, I would like to do this, I will try. It might even help performance, I
have seen several blog posts mentioning problems with stalls at binlog
rotation, and this might help.

As you say, this should be easy. I need to store a list of cookies and their
associated LSN (not just a single cookie), but this is infrequently executed
so just a simple linked list with malloc/free of elements will work fine.

>> +void
>> +ha_commit_checkpoint_notify(handlerton *hton, void *cookie)
> are there other ha_xxx methods that a storage engine calls?
> typically server calls methods, like, ha_commit_trans, and they invoke storage
> engine methods. Not the other way around. It is exactly for this reason I used names
> like "trans_register_ha" - inverse of ha_register_trans - to emphasize that this
> method is invoked by the engine to call into the server, not by the server to
> call into the engine.

Ah, thanks, good point. I suggest "commit_checkpoint_notify_ha()", what do you

>> === modified file 'sql/log.h'
>> --- a/sql/log.h	2012-06-22 09:46:28 +0000
>> +++ b/sql/log.h	2012-06-22 10:20:29 +0000
>> @@ -359,9 +373,14 @@ class MYSQL_QUERY_LOG: public MYSQL_LOG
>>    Ids start from BINLOG_COOKIE_START; the value BINLOG_COOKIE_DUMMY is special
>>    meaning "no binlog" (we cannot use zero as that is reserved for error return
>>    from log_and_order).
>> +  The low-order bit is reserved for an error flag.
>>  */

It is "right", but I agree it is incomprehensible. I will improve it.

The thing is, I need to store a number of different values in the "cookie"
used with log_and_order() and unlog():

 - 0, which is the error return value from log_and_order()

 - A pair (binlog_id, error_flag)

 - A special dummy binlog_id meaning that nothing was logged to the binlog.

I will change it to this, what do you think? :

    We assign each binlog file an internal ID, used to identify them for unlog().
    The IDs start from 0 and increment for each new binlog created.

    In unlog() we need to know the ID of the binlog file that the corresponding
    transaction was written into. We also need a special value for a corner
    case where there is no corresponding binlog id (since nothing was logged),
    as well as an error flag to mark that unlog() must return failure.

    We use the following macros to pack all of this information into the single
    ulong available with log_and_order() / unlog().

    Note that we cannot use the value 0, as that is reserved as error return
    value from log_and_order().
  #define BINLOG_COOKIE_DUMMY(error_flag) \
    ( (BINLOG_COOKIE_DUMMY_ID<<1) | ((error_flag)&1) )
  #define BINLOG_COOKIE_MAKE(id, error_flag) \
    ( (((id)+BINLOG_COOKIE_BASE)<<1) | ((error_flag)&1) )
  #define BINLOG_COOKIE_GET_ERROR_FLAG(c) ((c) & 1)
  #define BINLOG_COOKIE_GET_ID(c) ( ((ulong)(c)>>1) - BINLOG_COOKIE_BASE )
    ( ((ulong)(c)>>1) == BINLOG_COOKIE_DUMMY_ID )

Then the code in log.cc can use these macros, and not have to depend on
strange details on how to pack different values into a single ulong.
I think the resulting code is much clearer, now BINLOG_COOKIE_* is only used
in log_and_order() and unlog() functions, thanks!

>>  class MYSQL_BIN_LOG: public TC_LOG, private MYSQL_LOG
>> @@ -464,15 +492,6 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
>>    uint file_id;
>>    uint open_count;                              // For replication
>>    int readers_count;
>> -  bool need_start_event;
>> -  /*
>> -    no_auto_events means we don't want any of these automatic events :
>> -    Start/Rotate/Stop. That is, in 4.x when we rotate a relay log, we don't
>> -    want a Rotate_log event to be written to the relay log. When we start a
>> -    relay log etc. So in 4.x this is 1 for relay logs, 0 for binlogs.
>> -    In 5.0 it's 0 for relay logs too!
>> -  */
>> -  bool no_auto_events;
> Why? Because it's some old code which isn't used nowadays?

Yes. This was dead code.

>> === modified file 'sql/log.cc'
>> --- a/sql/log.cc	2012-06-22 09:46:28 +0000
>> +++ b/sql/log.cc	2012-06-22 10:20:29 +0000
>> @@ -3597,12 +3623,40 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd)

>> +    /*
>>        We are going to nuke all binary log files.
>> -      So first wait until all pending binlog checkpoints have completed.
>> +      Without binlog, we cannot XA recover prepared-but-not-committed
>> +      transactions in engines. So force a commit checkpoint first.
>> +
>> +      Note that we take and immediately release LOCK_commit_ordered. This has
>> +      the effect to ensure that any on-going group commit (in
>> +      trx_group_commit_leader()) has completed before we request the checkpoint,
>> +      due to the chaining of LOCK_log and LOCK_commit_ordered in that function.
>> +      (We are holding LOCK_log, so no new group commit can start).
>> +
>> +      Without this, it is possible (though perhaps unlikely) that the RESET
>> +      MASTER could run in-between the write to the binlog and the
>> +      commit_ordered() in the engine of some transaction, and then a crash
>> +      later would leave such transaction not recoverable.
>>      */
>> +    mysql_mutex_lock(&LOCK_commit_ordered);
>> +    mysql_mutex_unlock(&LOCK_commit_ordered);
>> +
>> +    mark_xids_active(current_binlog_id, 1);
>> +    do_checkpoint_request(current_binlog_id);
> why do_checkpoint_request ?

There may be transactions in the current binlog file that are not yet flushed
to disk in the engine. If we proceed in RESET MASTER to delete the binlog
file, and then crash, then we could lose such transactions.

So we request a checkpoint for the current binlog file (we already did so for
any previous binlog files). And then wait for this and any other pending
checkpoint to complete. Only then can we safely delete binlog files.

>> @@ -3731,6 +3780,31 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd)

>> +    /*
>> +      Remove all entries in the xid_count list except the last.
>> +      Normally we will just be deleting all the entries that we waited for to
>> +      drop to zero above. But if we fail during RESET MASTER for some reason
>> +      then we will not have created any new log file, and we may keep the last
>> +      of the old entries.
> I'm not sure I understand this comment.
> What's the point of keeping the last of the old entries, if old binlog
> files are deleted and binary logging is probably stopped (when RESET MASTER fails
> it's kinda serious)?

The code in sql/log.cc assumes that binlog_xid_count_list is never empty. So
even in the error case we must be sure to leave at least one entry in there.

So the comment is to explain why we do not simply delete everything in the
list at the start, and then insert the new entry on top of that. If we get an
error half-way, error handling is just easier by doing the deletion at the
end, where we know what we can delete and what not.

It is true that RESET MASTER failure is quite serious, and if we cannot open a
new binlog after deleting the old ones we are indeed in trouble. And there are
very likely legacy bugs lurking in this error case. But with this code, at
least no new error handling issues are introduced, and the rest of the code is
clean and can rely on having at least one entry in the list.

> If some engine confirms that everything below a certain cookie of
> a particular binlog_id is flushed. Does it automatically mean that
> all previous binlog_id's are flushed too in this engine?


This is the relevant comment in handler.h:

     The TC will periodically (eg. once per binlog rotation) call
     commit_checkpoint_request(). When this happens, the engine must arrange
     for all transaction that have completed commit_ordered() to be durably
     flushed to disk (this does not include transactions that might be in the
     middle of executing commit_ordered()). When such flush has completed, the
     engine must call commit_checkpoint_notify_ha(), passing back the opaque

commit_ordered() calls complete in the same order as transactions are written
in the binlog. And we request a checkpoint for a binlog file only when all
transactions in it have completed commit_ordered().

So when an engine has flushed all transactions that completed
commit_ordered(), this includes everything in the given binlog, as well as
anything in previous binlogs.

>> @@ -5493,7 +5656,27 @@ int MYSQL_BIN_LOG::rotate(bool force_rot
>>        if (!write_incident_already_locked(current_thd))
>>          flush_and_sync(0);
>> -    *check_purge= true;
>> +      /*
>> +        We failed to rotate - so we have to decrement the xid_count back that
>> +        we incremented before attempting the rotate.
>> +      */
> why you don't use  mark_xid_done() here?

We cannot call mark_xid_done() here, as we are holding LOCK_log - and
mark_xid_done() also takes (sometimes) LOCK_log.

>> @@ -5920,22 +6115,29 @@ MYSQL_BIN_LOG::write_binlog_checkpoint_e
>>    */
>>    if (!ev.write(&log_file) && !flush_and_sync(0))
>>    {
>> -    bool check_purge= false;
>>      signal_update();
>> -    rotate(false, &check_purge);
>> -    if (check_purge)
>> -      purge();
> Why did you remove rotate/purge from here?

Binlog rotation triggers the writing of checkpoint event in the first
place. So it does not seem right that writing checkpoint event can trigger
binlog rotation. This could in principle lead to infinite rotation of new
binlogs, though one would need a _very_ small max binlog size to trigger that.

>> +int TC_LOG_MMAP::unlog(ulong cookie, my_xid xid)
>> +{
>> +  pending_cookies *full_buffer= NULL;
>> +  DBUG_ASSERT(*(my_xid *)(data+cookie) == xid);
>> +
>> +  /*
>> +    Do not delete the entry immediately, as there may be participating storage
>> +    engines which implement commit_checkpoint_request(), and thus have not yet
>> +    flushed the commit durably to disk.
>> +
>> +    Instead put it in a queue - and periodically, we will request a checkpoint
>> +    from all engines and delete a whole batch at once.
>> +  */
>> +  mysql_mutex_lock(&LOCK_pending_checkpoint);
>> +  if (pending_checkpoint == NULL)
>> +  {
>> +    uint32 size= sizeof(*pending_checkpoint);
>> +    if (!(pending_checkpoint=
>> +          (pending_cookies *)my_malloc(size, MYF(MY_ZEROFILL))))
>> +    {
>> +      my_error(ER_OUTOFMEMORY, MYF(0), size);
>> +      mysql_mutex_unlock(&LOCK_pending_checkpoint);
>> +      return 1;
>> +    }
>> +    pending_checkpoint->cookies[pending_checkpoint->count++]= cookie;
> Hm, should this be under if (pending_checkpoint == NULL) ?

No :-(

That seems completely, utterly broken :-(

So this means that I must have not tested this code at all. Which makes me
sad. Of course, with PBXT dead, I am not sure how it is even possible to test
it... ?

Thanks for spotting.

>> +    /*
>> +      We do an extra increment and notify here - this ensures that
>> +      things work also if there are no engines at all that support
>> +      commit_checkpoint_request.
>> +    */
>> +    ++full_buffer->pending_count;
>> +    ha_commit_checkpoint_request(full_buffer, mmap_do_checkpoint_callback);
>> +    commit_checkpoint_notify(full_buffer);
> I'm sorry, I don't understand how you handle the case when some engines don't
> support commit_checkpoint_request

If an engine supports commit_checkpoint_request, then
ha_commit_checkpoint_request() will ask the engine to do a checkpoint; and
pending_count will be incremented for each such engine to ensure that we wait
for each engine to finish the checkpoint before we delete any entries from the
on-disk log.

If an engine does not support commit_checkpoint_request, then we will not do
anything with that engine before deleting entries. But such engines are
required to make their transactions durable already in their commit() method,
so deletion is always safe.

If _no_ engines support ha_commit_checkpoint_request then we would not get any
callback to commit_checkpoint_notify() - so we add one extra such a call
ourselves, at the end.

Also, the increment of pending_count and decrement in the extra
commit_checkpoint_notify() call protects against this race:

 - We call engine A's commit_checkpoint_request()

 - Before we have time to call engine B's commit_checkpoint_request(), engine
   A does commit_checkpoint_notify(), so pending_count drops to zero and we
   delete entries even though B has not had a chance to do a checkpoint yet

 - We call engine B's commit_checkpoint_request with a void *cookie pointing
   to freed memory.

Hope this clarifies things.

>> @@ -7869,6 +8196,11 @@ int TC_LOG_BINLOG::recover(LOG_INFO *lin
>>        if (!binlog_checkpoint_found)
>>          break;
>>        first_round= false;
>> +      DBUG_EXECUTE_IF("xa_recover_expect_master_bin_000004",
>> +          if (0 != strcmp("./master-bin.000004", binlog_checkpoint_name) &&
>> +              0 != strcmp(".\\master-bin.000004", binlog_checkpoint_name))
> Are you sure the name is always lowercase even on case-insensitive
> filesystems? I'd use strcasecmp to avoid the need to think about it :)
> Something like
>    if (0 != strcasecmp("master-bin.000004", binlog_checkpoint_name+2))
> (right, unsafe if strlen(binlog_checkpoint_name) < 2, but it's just a test case)

I don't understand - this has nothing to do with file system case sensitivity.

The string is generated inside log.cc based on what mysql-test-run.pl passes
to mysqld for the --log-bin option, I don't see how this could change

And I thought it was clearer to explicitly show the two possible cases for the
full file name (otherwise the offset 2 in the comparison is completely magic).

But as you said, this is just a test case, I'm happy to change it if you
insists (I just did not understand the reason to change it).

BTW, as I remember this was for something that was quite hard to test in
mysql-test-run. Since recovery happens early in the server startup, before
anything mysqltest can do. So I could not think of anything better than to
inject an error insert with a configuration option.

 - Kristian.

Follow ups