← Back to team overview

maria-developers team mailing list archive

Re: MDEV-232: Remove one fsync() from commit phase

 

Hi, Kristian!

On Aug 17, Kristian Nielsen wrote:
> 
> Do you want to see the changes for when I implement async
> commit_checkpoint_request() in InnoDB/XtraDB, before I push it to 10.0?

Yes, but it can be a separate patch.

> >> === modified file 'storage/xtradb/handler/ha_innodb.cc'
> >> --- a/storage/xtradb/handler/ha_innodb.cc	2012-06-15 12:54:23 +0000
> >> +++ b/storage/xtradb/handler/ha_innodb.cc	2012-06-22 10:20:29 +0000
> >> @@ -485,8 +486,11 @@ static MYSQL_THDVAR_ULONG(lock_wait_time
> >>  static MYSQL_THDVAR_ULONG(flush_log_at_trx_commit, PLUGIN_VAR_OPCMDARG,
> >>    "Set to 0 (write and flush once per second),"
> >>    " 1 (write and flush at each commit)"
> >> -  " or 2 (write at commit, flush once per second).",
> >> -  NULL, NULL, 1, 0, 2, 0);
> >> +  " or 2 (write at commit, flush once per second)."
> >> +  " Can also be set to 3 to write and flush at each commit, even when this is"
> >> +  " unnecessary because of binlog crash recovery; should not normally be used"
> >> +  " due to performance degration, but provided for completeness.",
> >
> > would you like to improve the description on =1 value? For example,
> > 1 (write and flush on prepare or on commit)
> > 3 (write and flush both on prepare and on commit, usually it is redundant)
> 
> Yes, good idea. Here is my suggestion:
> 
>   "Controls the durability/speed trade-off for commits."
>   " Set to 0 (write and flush redo log to disk only once per second),"
>   " 1 (flush to disk at each commit),"
>   " 2 (write to kernel at commit but flush to disk only once per second)"

I find "write to kernel" very confusing.
May be "write to log" ?

>   " or 3 (flush to disk twice per commit, usually redundant)."

I'd like "flush at prepare and at commit, usually redundant",
because otherwise one may ask why to flush twice at commit. It's easy to
overlook the prepare step and miss the difference between "twice PER
commit" and "twice AT commit" (especially as you write "AT commit"
above, for =1).

>   " 1 and 3 guarantees that after a crash, committed transactions will"
>   " not be lost and will be consistent with the binlog and other transactional"
>   " engines. 2 can get inconsistent and lose transactions if there is a"
>   " power failure or kernel crash but not if mysqld crashes. 0 has no"
>   " guarantees in case of crash. 0 and 2 can be faster than 1 or 3.",

Good explanation!

> > What about making innobase_checkpoint_request() implementation that flushed
> > in the background? At least to serve as an example of the correct usage of your
> > API, if not for performance. And to test your code for cases when the notification
> > isn't coming directly from the checkpoint_request method.
> >
> > it looks like it's very easy to make innodb to invoke ha_commit_checkpoint_notify
> > from the background flusher thread (the one that syncs once a second).
> > you only need to store the cookie in the log_sys, and invoke
> > ha_commit_checkpoint_notify after the flush when this stored cookie isn't NULL.
> 
> Yes, I would like to do this, I will try. It might even help performance, I
> have seen several blog posts mentioning problems with stalls at binlog
> rotation, and this might help.
> 
> As you say, this should be easy. I need to store a list of cookies and their
> associated LSN (not just a single cookie), but this is infrequently executed
> so just a simple linked list with malloc/free of elements will work fine.

I thought you don't need a list. But simply

  log_sys->cookie= cookie;

in innobase_checkpoint_request() and

  if (log_sys->cookie)
  {
    ha_commit_checkpoint_notify(log_sys->cookie);
    log_sys->cookie= 0;
  }

in log_buffer_sync_in_background().

There can never be two cookies active, right?

> >> +void
> >> +ha_commit_checkpoint_notify(handlerton *hton, void *cookie)
> >
> > are there other ha_xxx methods that a storage engine calls?
> >
> > typically server calls methods, like, ha_commit_trans, and they
> > invoke storage engine methods. Not the other way around. It is
> > exactly for this reason I used names like "trans_register_ha" -
> > inverse of ha_register_trans - to emphasize that this method is
> > invoked by the engine to call into the server, not by the server to
> > call into the engine.
> 
> Ah, thanks, good point. I suggest "commit_checkpoint_notify_ha()", what do you
> think?

Looks better, thanks!

> >> @@ -5493,7 +5656,27 @@ int MYSQL_BIN_LOG::rotate(bool force_rot
> >>        if (!write_incident_already_locked(current_thd))
> >>          flush_and_sync(0);
> >>  
> >> -    *check_purge= true;
> >> +      /*
> >> +        We failed to rotate - so we have to decrement the xid_count back that
> >> +        we incremented before attempting the rotate.
> >> +      */
> >
> > why you don't use  mark_xid_done() here?
> 
> We cannot call mark_xid_done() here, as we are holding LOCK_log - and
> mark_xid_done() also takes (sometimes) LOCK_log.

Ok, so here you only decrement the counter, and mark_xid_done()
additionally might log a checkpoint event. It looks like it might sense
to extend mark_xid_done() with a second argument

-TC_LOG_BINLOG::mark_xid_done(ulong cookie)
+TC_LOG_BINLOG::mark_xid_done(ulong cookie, bool write_checkpoint)

then you'll be able to use it here and avoid code duplication.

> >> @@ -7869,6 +8196,11 @@ int TC_LOG_BINLOG::recover(LOG_INFO *lin
> >>        if (!binlog_checkpoint_found)
> >>          break;
> >>        first_round= false;
> >> +      DBUG_EXECUTE_IF("xa_recover_expect_master_bin_000004",
> >> +          if (0 != strcmp("./master-bin.000004", binlog_checkpoint_name) &&
> >> +              0 != strcmp(".\\master-bin.000004", binlog_checkpoint_name))
...
> > Something like
...
> But as you said, this is just a test case, I'm happy to change it if you
> insists (I just did not understand the reason to change it).

No, that's okay. Keep it your way.

Regards,
Sergei


Follow ups

References