maria-developers team mailing list archive

Thread
Date

Re: Architecture review of MWL#116 "Efficient group commit for binary log"

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxxx>
Date: Tue, 7 Sep 2010 22:51:12 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87mxruva62.fsf@knielsen-hq.org>
User-agent: Mutt/1.5.16 (2007-06-09)

Hi, Kristian!

On Sep 06, Kristian Nielsen wrote:
> However, as I revisited the algorithm, it occured to me that it is in
> any case better to wake up threads individually as soon as
> commit_ordered() has finished. This way, the first threads in the
> queue are free to continue doing useful work while we are still
> running commit_ordered() for the last threads.
> 
> So now the algorithm is something like this:
> 
>     thd->ready= false
>     lock(LOCK_prepare_ordered)
>     old_queue= group_commit_queue
>     thd->next= old_queue
>     group_commit_queue= thd
>     ht->prepare_ordered()
>     unlock(LOCK_prepare_ordered)
> 
>     if (old_queue == NULL) // leader?
>         lock(LOCK_group_commit)
> 
>         lock(LOCK_prepare_ordered)
>         queue= reverse(group_commit_queue)
>         group_commit_queue= NULL
>         unlock(LOCK_prepare_ordered)
> 
>         group_log_xid(queue)
> 
>         lock(LOCK_commit_ordered)  // but see below
>         unlock(LOCK_group_commit)
>         for thd2 in <queue>
>             lock(thd2->LOCK_wakeup)
>             thd2->ready= true
>             signal(thd2->COND_wakeup)
>             unlock(thd2->LOCK_wakeup)
>         unlock(LOCK_commit_ordered)
>     else
>         lock (thd->LOCK_wakeup)
>         while (!thd->ready)
>             wait(COND_wakeup, LOCK_wakeup)
>         unlock (thd->LOCK_wakeup)
> 
>     cookie= xid_log_after()

Where in this algorithm you call ht->commit_ordered() ?
         
> On the other hand, the algorithm I suggested earlier for START
> TRANSACTION WITH CONSISTENT SNAPSHOT used the LOCK_commit_ordered, and
> there might be other uses...
> 
> So I am not sure. I'd like to think more about it, or what do you
> think?

START TRANSACTION WITH CONSISTENT SNAPSHOT is a good reason to keep the
mutex.
 
> >> It would be possible to iterate over the queue to invoke
> >> prepare_ordered() in sequence from a single thread, just like
> >> group_log_xid() and commit_ordered(). But this would delay the
> >> calls until the previous group commit is done and the next one
> >> starts
> >
> > No, why ? You only need one atomic fetch_and_store to copy the queue
> > head to a local variable and reset the queue. Then you can call
> > prepare_ordered or commit_ordered in the queue order without any
> > mutex.
> 
> I am not sure if I understood your suggestion correctly. But what I
> considered with the above statement about "delaying calls to
> prepare_ordered()" is this:
> 
> Just like the group leader thread runs commit_ordered() in a loop over
> the queue just after group_log_xid(), we could have it do a similar
> loop for prepare_ordered() just before group_log_xid().

Yes.
 
> But I choose to do it earlier, as soon as the transaction is put in
> the queue and commit order thereby defined.
> 
> There can be quite a "long" time interval between these two events:
> the time it takes for the previous group_log_xid() (eg. an fsync()),
> plus sometimes one wants to add extra sleeps in group commit to group
> more transactions together.

No.
The long interval is *inside* the group_log_xid(), while you call
prepare_ordered() *before* it.

But anyway, the LOCK_prepare_ordered mutex is not going to be contented,
so removing it by using a lock-free queue (that's what this second
approach is about) will not bring any noticeable benefits.
 
> The main performance bottleneck I am introducing is, I think, the
> serialisation of the commit_ordered() part of commit. Not just for
> some particular engine implementation, but for the interface. That is
> not a decision to be taken lightly.
> 
> Of course, compared to InnoDB today, it's much better, as it gets rid
> of the InnoDB prepare_commit_mutex (which spans all the way from end
> of prepare() to end of what is effectively commit_ordered()), and also
> avoids taking LOCK_log over all of log_xid() in the binlog.
> 
> But for something like NDB, I think serialised commit order would
> really hurt (if it even makes sense ...)
> 
> Maybe the answer here is that engines can choose to support
> commit_ordered() or not (and NDB-like engines probably will not). And
> if not, there is just no support for consistent commit order.
> 
> And if we implement the simple way to recover engines from binlog
> without fsync() in prepare() and commit(), then it will only work for
> engines supporting commit_ordered(). Later we could implement the more
> complex recovery without need for commit_ordered() support.

It's reasonable to say that if an engine does not implement
commit_ordered() then it needs to take care of its own recovery and
fsync both in prepare and commit.

Regards,
Sergei

Follow ups

Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Kristian Nielsen, 2010-09-08

References

Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Sergei Golubchik, 2010-09-02
Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Kristian Nielsen, 2010-09-03
Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Sergei Golubchik, 2010-09-03
Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Kristian Nielsen, 2010-09-06