maria-developers team mailing list archive

Thread
Date
Re: MDEV-532: Async InnoDB commit checkpoint

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxxx>
Date: Wed, 12 Dec 2012 19:19:37 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <871ui4ix2a.fsf@frigg.knielsen-hq.org>
User-agent: Mutt/1.5.21 (2010-09-15)
Hi, Kristian!

Please, find the review below!
I had only few minor comments.

On Sep 14, Kristian Nielsen wrote:
> At http://bazaar.launchpad.net/~maria-captains/maria/10.0
>
> ------------------------------------------------------------
> revno: 3435
> revision-id: knielsen@xxxxxxxxxxxxxxx-20120914124453-zsap6hjclq3vrb6n
> parent: knielsen@xxxxxxxxxxxxxxx-20120913123129-kaujy4cw0jc9o08k
> committer: knielsen@xxxxxxxxxxxxxxx
> branch nick: work-10.0-mdev225-181-232
> timestamp: Fri 2012-09-14 14:44:53 +0200
> message:
>   MDEV-532: Async InnoDB commit checkpoint.
>   
>   Make the commit checkpoint inside InnoDB be asynchroneous.
>   Implement a background thread in binlog to do the writing and flushing of
>   binlog checkpoint events to disk.

> === modified file 'sql/log.cc'
> --- a/sql/log.cc	2012-09-13 12:31:29 +0000
> +++ b/sql/log.cc	2012-09-14 12:44:53 +0000
> @@ -106,6 +107,14 @@ static SHOW_VAR binlog_status_vars_detai
>    {NullS, NullS, SHOW_LONG}
>  };
>  
> +/* Variables for the binlog background thread. */

Please add to the comment: ", protected by the
LOCK_binlog_thread mutex"

> +static bool binlog_thread_started= false;
> +static bool binlog_background_thread_stop= false;

I'd prefer if both variables would start from "binlog_thread" or both
from "binlog_background_thread". Either way, but the same for both
variables.

Also, the mutex is LOCK_binlog_thread. The function is
start_binlog_background_thread(). The condition is COND_binlog_thread.
The queue is binlog_background_thread_queue.

Could you somehow decide whether the official name for it is a "binlog
background thread" or simply a "binlog thread" and use it consistently
everywhere? It's also important for the configuration or status
variables, if any, for the warnings, status or error messages, for the
documentation pages, etc.

> +static MYSQL_BIN_LOG::xid_count_per_binlog *
> +    binlog_background_thread_queue= NULL;
> +
> +static bool start_binlog_background_thread();
> +
>  
>  /**
>     purge logs, master and slave sides both, related error code
> @@ -8116,9 +8145,128 @@ int TC_LOG_BINLOG::unlog(ulong cookie, m
>  void
>  TC_LOG_BINLOG::commit_checkpoint_notify(void *cookie)
>  {
> -  mark_xid_done(((xid_count_per_binlog *)cookie)->binlog_id, true);
> +  xid_count_per_binlog *entry= static_cast<xid_count_per_binlog *>(cookie);
> +  mysql_mutex_lock(&LOCK_binlog_thread);
> +  entry->next_in_queue= binlog_background_thread_queue;
> +  binlog_background_thread_queue= entry;
> +  mysql_cond_signal(&COND_binlog_thread);
> +  mysql_mutex_unlock(&LOCK_binlog_thread);
>  }
>  
> +/*
> +  Binlog service thread.
> +
> +  This thread is used to log binlog checkpoints in the background, rather than
> +  in the context of random storage engine threads that happen to call
> +  commit_checkpoint_notify_ha() and may not like the delays while syncing
> +  binlog to disk or may not be setup with all my_thread_init() and other
> +  necessary stuff.
> +
> +  In the future, this thread could also be used to do log rotation in the
> +  background, which could elimiate all stalls around binlog rotations.
> +*/
> +pthread_handler_t
> +binlog_background_thread(void *arg __attribute__((unused)))
> +{
> +  bool stop;
> +  MYSQL_BIN_LOG::xid_count_per_binlog *queue, *next;
> +  THD *thd;
> +
> +  my_thread_init();
> +  thd= new THD;
> +  thd->system_thread= SYSTEM_THREAD_BINLOG_BACKGROUND;
> +  my_pthread_setspecific_ptr(THR_THD, thd);

^^^ the normal way of setting up a THD, would be to call
thd->store_globals(). Why did that not work for you and you had to
resort to the manual my_pthread_setspecific_ptr() ?

> +  mysql_mutex_lock(&LOCK_thread_count);
> +  thd->thread_id= thread_id++;
> +  mysql_mutex_unlock(&LOCK_thread_count);
> +
> +  for (;;)
> +  {
> +    /*
> +      Wait until there is something in the queue to process, or we are asked
> +      to shut down.
> +    */
> +    thd_proc_info(thd, "Waiting for background binlog tasks");

In 10.0 you need to use THD_STAGE_INFO()
(here and below - everywhere instead of thd_proc_info)

> +    mysql_mutex_lock(&mysql_bin_log.LOCK_binlog_thread);
> +    for (;;)
> +    {
> +      stop= binlog_background_thread_stop;
> +      queue= binlog_background_thread_queue;
> +      if (stop || queue)
> +        break;
> +      mysql_cond_wait(&mysql_bin_log.COND_binlog_thread,
> +                      &mysql_bin_log.LOCK_binlog_thread);
> +    }
> +    /* Grab the queue, if any. */
> +    binlog_background_thread_queue= NULL;
> +    mysql_mutex_unlock(&mysql_bin_log.LOCK_binlog_thread);
...
> === modified file 'storage/innobase/handler/ha_innodb.cc'
> --- a/storage/innobase/handler/ha_innodb.cc	2012-09-13 12:31:29 +0000
> +++ b/storage/innobase/handler/ha_innodb.cc	2012-09-14 12:44:53 +0000
> @@ -3008,6 +3015,16 @@ innobase_rollback_trx(
>          DBUG_RETURN(convert_error_code_to_mysql(error, 0, NULL));
>  }
>  
> +
> +struct pending_checkpoint {
> +        struct pending_checkpoint *next;
> +        handlerton *hton;
> +        void *cookie;
> +        ib_uint64_t lsn;
> +};
> +static struct pending_checkpoint *pending_checkpoint_list;
> +static struct pending_checkpoint *pending_checkpoint_list_end;
> +
>  /*****************************************************************//**
>  Handle a commit checkpoint request from server layer.
>  We simply flush the redo log immediately and do the notify call.*/

The comment seems to be wrong now, doesn't it?

> @@ -3017,8 +3034,113 @@ innobase_checkpoint_request(
>          handlerton *hton,
>          void *cookie)
>  {
> -        log_buffer_flush_to_disk();
> -        commit_checkpoint_notify_ha(hton, cookie);
> +        ib_uint64_t                     lsn;
> +        ib_uint64_t                     flush_lsn;
> +        struct pending_checkpoint *     entry;
> +
> +        /* Do the allocation outside of lock to reduce contention. The normal
> +        case is that not everything is flushed, so we will need to enqueue. */
> +        entry = static_cast<struct pending_checkpoint *>
> +                (my_malloc(sizeof(*entry), MYF(MY_WME)));
> +        if (!entry) {
> +                sql_print_error("Failed to allocate %u bytes."
> +                                " Commit checkpoint will be skipped.",
> +                                static_cast<unsigned>(sizeof(*entry)));
> +                return;
> +        }
> +
> +        entry->next = NULL;
> +        entry->hton = hton;

why do you need to remember the handlerton here? It's always
innodb_hton_ptr, isn't it?

Why do you even pass hton down as an argument, to be able to reuse one
checkpoint request function for many engines?

> +        entry->cookie = cookie;
> +
> +        mysql_mutex_lock(&pending_checkpoint_mutex);
> +        lsn = log_get_lsn();
> +        flush_lsn = log_get_flush_lsn();
> +        if (lsn > flush_lsn) {
> +                /* Put the request in queue.
> +                When the log gets flushed past the lsn, we will remove the
> +                entry from the queue and notify the upper layer. */
> +                entry->lsn = lsn;
> +                if (pending_checkpoint_list_end) {
> +                        pending_checkpoint_list_end->next = entry;

please add an assert here that

    pending_checkpoint_list_end->lsn < entry->lsn

> +                } else {
> +                        pending_checkpoint_list = entry;
> +                }
> +                pending_checkpoint_list_end = entry;
> +                entry = NULL;
> +        }
> +        mysql_mutex_unlock(&pending_checkpoint_mutex);
> +
> +        if (entry) {
> +                /* We are already flushed. Notify the checkpoint immediately. */
> +                commit_checkpoint_notify_ha(entry->hton, entry->cookie);
> +                my_free(entry);

I'm a bit worried, whether you're going to do a lot of small
allocations. Perhaps it'd be better to use an allocator of a fixed-size
objects, with a pool. To avoid going to the heap too often. The same
applies to the new code log.cc.

Lock-free allocator can do that. May be we have other fixed-size
allocators with a pool too.

> +        }
> +}

Regards,
Sergei
Follow ups

Re: MDEV-532: Async InnoDB commit checkpoint
From: Kristian Nielsen, 2012-12-14
References

Re: [Commits] Rev 3435: MDEV-532: Async InnoDB commit checkpoint. in http://bazaar.launchpad.net/~maria-captains/maria/10.0
From: Kristian Nielsen, 2012-09-14