maria-developers team mailing list archive

Thread
Date

Re: MDEV-532: Async InnoDB commit checkpoint

To: Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 14 Dec 2012 14:36:21 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20121212181937.GA22463@meddwl.fritz.box> (Sergei Golubchik's message of "Wed, 12 Dec 2012 19:19:37 +0100")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Please, find the review below!

Thanks! Answers inline, for those of your comments not mentioned below I just
followed your suggestions.

 - Kristian.

>> +        entry->cookie = cookie;
>> +
>> +        mysql_mutex_lock(&pending_checkpoint_mutex);
>> +        lsn = log_get_lsn();
>> +        flush_lsn = log_get_flush_lsn();
>> +        if (lsn > flush_lsn) {
>> +                /* Put the request in queue.
>> +                When the log gets flushed past the lsn, we will remove the
>> +                entry from the queue and notify the upper layer. */
>> +                entry->lsn = lsn;
>> +                if (pending_checkpoint_list_end) {
>> +                        pending_checkpoint_list_end->next = entry;
>
> please add an assert here that
>
>     pending_checkpoint_list_end->lsn < entry->lsn

Right, but this condition can occur, and is valid. (And it is tested - I
verified that running the binlog_xa_recover test will cause this assertion to
trigger).

Binlog checkpoint is completely asynchroneous. There is no ordering enforced,
so it is possible for checkpoint N+1 to be processed before N (in practise it
is though rather unlikely to occur).

If this happens, then we will insert (N) after (N+1) in the list. And we will
notify checkpoint (N+1) before (N). This is fine, the upper layer handles
this, and checks that all prior checkpoints have been notified when it
processes a notification.

You can see this in binlog_xa_recover.result, under the test headed
"Test that with multiple binlog checkpoints, recovery starts from the last one."

master-bin.000004	#	Binlog_checkpoint	#	#	master-bin.000002
master-bin.000004	#	Binlog_checkpoint	#	#	master-bin.000004

Here DBUG_SYNC is used to make the master-bin.000004 checkpoint occur before
master-bin.000003. So what happens is that first InnoDB notifies about
master-bin.000004, the upper layer does nothing as master-bin.000003 is still
pending. Then when master-bin.000003 is notified, upper layer sees that both 3
and 4 are ready, and logs directly master-bin.000004.

I added this comment instead of the assert, hopefully it will make things a
bit clearer:

    /* There is no need to order the entries in the list
    by lsn. The upper layer can accept notifications in
    any order, and short delays in notifications do not
    significantly impact performance. */

>> +                } else {
>> +                        pending_checkpoint_list = entry;
>> +                }
>> +                pending_checkpoint_list_end = entry;
>> +                entry = NULL;
>> +        }
>> +        mysql_mutex_unlock(&pending_checkpoint_mutex);
>> +
>> +        if (entry) {
>> +                /* We are already flushed. Notify the checkpoint immediately. */
>> +                commit_checkpoint_notify_ha(entry->hton, entry->cookie);
>> +                my_free(entry);
>
> I'm a bit worried, whether you're going to do a lot of small
> allocations. Perhaps it'd be better to use an allocator of a fixed-size
> objects, with a pool. To avoid going to the heap too often. The same
> applies to the new code log.cc.

It will not be a lot of allocations. Binlog checkpoints are only made once per
binlog rotation, not per-commit or something like that. So it is ok to do a
simple malloc().

>> @@ -3017,8 +3034,113 @@ innobase_checkpoint_request(
>>          handlerton *hton,
>>          void *cookie)
>>  {

>> +        entry->hton = hton;
>
> why do you need to remember the handlerton here? It's always
> innodb_hton_ptr, isn't it?
>
> Why do you even pass hton down as an argument, to be able to reuse one
> checkpoint request function for many engines?

There is no special reason, I agree it is probably not needed.
But I added it to be consistent with commit(), prepare(), rollback(),
etc. which all take the handlerton as argument.

Let me know if you think I should remove entry->hton, and if I should remove
hton as argument for commit_checkpoint_request(). For now, I will leave
them ...

> I'd prefer if both variables would start from "binlog_thread" or both
> from "binlog_background_thread". Either way, but the same for both
> variables.

Indeed, I agree.

> Could you somehow decide whether the official name for it is a "binlog
> background thread" or simply a "binlog thread" and use it consistently
> everywhere? It's also important for the configuration or status

I decided on "binlog background thread", and changed everywhere to be
consistent with this.

>> +  thd= new THD;
>> +  thd->system_thread= SYSTEM_THREAD_BINLOG_BACKGROUND;
>> +  my_pthread_setspecific_ptr(THR_THD, thd);
>
> ^^^ the normal way of setting up a THD, would be to call
> thd->store_globals(). Why did that not work for you and you had to
> resort to the manual my_pthread_setspecific_ptr() ?

Well, the reason is that I did not know what is the normal/correct way to
create a thread. I think I wanted to have only the necessary stuff in there.
Originally I wanted to not have even a THD (as THD has tons of stuff that is
not needed in the binlog background thread), but then DEBUG_SYNC does not
work.

I guess this is the wrong approach, and it is better to have THD and all the
usual stuff in it. I changed to use thd->store_globals() as you suggested (and
then I also needed to set THD::thread_stack).

> In 10.0 you need to use THD_STAGE_INFO()
> (here and below - everywhere instead of thd_proc_info)

Ok. I'll push it as is to 10.0-base, and then change to THD_STAGE_INFO() when
I merge it to 10.0. Thanks for the notification.

>>  /*****************************************************************//**
>>  Handle a commit checkpoint request from server layer.
>>  We simply flush the redo log immediately and do the notify call.*/
>
> The comment seems to be wrong now, doesn't it?

Good catch, thanks! I changed to this:

/*****************************************************************//**
Handle a commit checkpoint request from server layer.
We put the request in a queue, so that we can notify upper layer about
checkpoint complete when we have flushed the redo log.
If we have already flushed all relevant redo log, we notify immediately.*/

 - Kristian.

References

Re: [Commits] Rev 3435: MDEV-532: Async InnoDB commit checkpoint. in http://bazaar.launchpad.net/~maria-captains/maria/10.0
From: Kristian Nielsen, 2012-09-14
Re: MDEV-532: Async InnoDB commit checkpoint
From: Sergei Golubchik, 2012-12-12