maria-developers team mailing list archive

Thread
Date
Re: Architecture review of MWL#116 "Efficient group commit for binary log"

To: Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 06 Sep 2010 23:23:17 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20100903193308.GA12082@janus.mylan> (Sergei Golubchik's message of "Fri\, 3 Sep 2010 21\:33\:08 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Hi, Kristian!

> Hot and cool, yes, I've used (and coded) them myself :)
> But I only use them where I believe they actually improve concurrency,
> and I think this one is not the case.

Very true.

So let us forget about the lock-free stuff. I'll remove the atomics and just
protect the queue with the already taken LOCK_prepare_ordered mutex. Thanks
for persisting ;-)

(And I'll try to think of a better name for LOCK_prepare_ordered).

> But I think that in many cases one can rewrite broadcast to a set of
> signals and one condition per thread. It's kind of optimizing the thread
> library itself - I typically prefer to use the library as designed and
> leave its optimization to library developers.

Yes, I agree with this general principle.

(In this case the issue is a bit different, as it is the semantics of
pthread_cond_broadcast() that we want to change, not the implementation we
want to optimise. But that is getting off-topic I guess).

> But anyway, the fact is that I didn't benchmark and don't know for sure
> what's better one broadcast or many signals.

So I could not resist the temptation to actually do the benchmark, Turns out
that pthread_cond_broadcast() is slower than individual signal, more so the
more cores are available (5 times slower were observed, check my blog if
you're interested in the details).

However, as I revisited the algorithm, it occured to me that it is in any case
better to wake up threads individually as soon as commit_ordered() has
finished. This way, the first threads in the queue are free to continue doing
useful work while we are still running commit_ordered() for the last threads.

So now the algorithm is something like this:

    thd->ready= false
    lock(LOCK_prepare_ordered)
    old_queue= group_commit_queue
    thd->next= old_queue
    group_commit_queue= thd
    ht->prepare_ordered()
    unlock(LOCK_prepare_ordered)

    if (old_queue == NULL) // leader?
        lock(LOCK_group_commit)

        lock(LOCK_prepare_ordered)
        queue= reverse(group_commit_queue)
        group_commit_queue= NULL
        unlock(LOCK_prepare_ordered)

        group_log_xid(queue)

        lock(LOCK_commit_ordered)  // but see below
        unlock(LOCK_group_commit)
        for thd2 in <queue>
            lock(thd2->LOCK_wakeup)
            thd2->ready= true
            signal(thd2->COND_wakeup)
            unlock(thd2->LOCK_wakeup)
        unlock(LOCK_commit_ordered)
    else
        lock (thd->LOCK_wakeup)
        while (!thd->ready)
            wait(COND_wakeup, LOCK_wakeup)
        unlock (thd->LOCK_wakeup)

    cookie= xid_log_after()
        
You suggest a couple more simplifications:

>> The reason for the LOCK_commit_ordered is that there are two code
>> paths that call commit_ordered(). One is the group commit leader
>> thread that invokes commit_ordered() for all threads that have passed
>> group_log_xid(). The other is when no 2-phase commit is done,
>> ha_commit_one_phase(). It does not actually matter in which order the
>> non-2-phase commit_ordered() calls happen, but I still wanted to
>> provide the storage engine the guarantee that commit_ordered() would
>> not be invoked by multiple threads in parallel. It just seemed cleaner
>> that way to me.
>
> Uhm, I don't know. The engine will have its own shared data structures
> and locks anyway, so it'll ensure that commits cannot be executed in
> parallel. I don't see why we need to take care of that.

Right. Hm, originally I thought it was cleaner if the commit_ordered() call
was protected by a mutex by the API, thinking that a call that defines commit
order should not run in parallel. But now I tend to agree with you ... if two
commit_ordered() run in parallel, it just means there is no enforced ordering
between them, and the engine will most likely have to do its own locking
anyway as you say.

So maybe we should just omit it...

On the other hand, the algorithm I suggested earlier for START TRANSACTION
WITH CONSISTENT SNAPSHOT used the LOCK_commit_ordered, and there might be
other uses...

In the above algorithm, the LOCK_commit_ordered makes the locking a bit more
fine-grained, as the next group of queued transactions can start
group_log_xid() while the previous group is running commit_ordered().

So I am not sure. I'd like to think more about it, or what do you think?

>> It would be possible to iterate over the queue to invoke
>> prepare_ordered() in sequence from a single thread, just like
>> group_log_xid() and commit_ordered(). But this would delay the calls
>> until the previous group commit is done and the next one starts
>
> No, why ? You only need one atomic fetch_and_store to copy the queue
> head to a local variable and reset the queue. Then you can call
> prepare_ordered or commit_ordered in the queue order without any mutex.

I am not sure if I understood your suggestion correctly. But what I considered
with the above statement about "delaying calls to prepare_ordered()" is this:

Just like the group leader thread runs commit_ordered() in a loop over the
queue just after group_log_xid(), we could have it do a similar loop for
prepare_ordered() just before group_log_xid().

But I choose to do it earlier, as soon as the transaction is put in the queue
and commit order thereby defined.

There can be quite a "long" time interval between these two events: the time
it takes for the previous group_log_xid() (eg. an fsync()), plus sometimes one
wants to add extra sleeps in group commit to group more transactions together.

So that is the "delay" I was refering to: the time it takes from a transaction
enters the queue, and until the prevous group commit is done and the new
leader can obtain LOCK_group_commit and run the next queue.

Since my main motivation for prepare_ordered() is to implement the InnoDB
early lock release, it seemed important to not delay the call in this
way. What do you think?

Note that it is essential to release the queue lock after enqueueing, then
wait for the LOCK_group_commit, then re-take the queue lock. Otherwise it
prevents other threads from grouping up in the queue, which is the whole
point.


> Still, what will happen if the context switch will stop the non-leader
> thread between enqueue_atomic() and group_commit_ready=FALSE ?
>
> Although, to fix that you don't need to chain locks, it could be
> probably enough to set group_commit_ready=FALSE before enqueue.

Yes, that is necessary, of course.


> I mean that even for group_log_xid() you can create a scheme with
> conditions and mutexes that "bounces execution" to every transaction
> thread to do the commit in there. So it's not really a "consequence of
> the group_log_xid()."

Yes, you are right, it is not necessary. What I meant was I think it is a
worthwhile optimisation.


> Personally, I don't know how bad is it, as a limitation.
> I think it should be pretty easy for the engine to use THD and store the
> data there. As for my_error and similar things - well, we can either
> defer them or use pthread_setspecific to set current_thd and
> mysys_vars appropriately. Or let my_error report its error and after
> commit move the error from the current_thd to the appropriate THD.
> Or, really, we'll have error_reporting service eventually (hopefully
> soon) and it'll solve the my_error problem.
>
> Anyway, if the engine uses pthread_getspecific or gcc thread local
> variables (__thread storage class - which is implemented not with
> pthread_getspecific but with some ELF magic that I didn't bother to read
> about) it will not work when we migrate the transaction from one thread
> to another. But, again, I don't know how much we should care about this
> case.

Ok, good, I also think it should be ok. I will document clearly that it is the
case, and as specified currently, prepare_ordered() commit_ordered() cannot
return an error anyway (such errors should be handled in prepare() or
commit()).

>> >>  - If we have consistent commit order, we can think about
>> >>  optimising commit to need only one fsync (for binlog); lost
>> >>  commits in storage engines can then be recovered from the binlog
>> >>  at crash recovery by re-playing against the engine from a
>> >>  particular point in the binlog.
>> >
>> > Why do you need consistent commit order for that?
>> 
>> Hm. The case I have in mind is this:
> ...
>> Now what will we do to recover? I guess we will have to replay A, B,
>> and D but not C and E. So we need to get a _list_ of missing
>> transactions from the engine somehow (and so also need something
>> similar to unlog() against the
>
> Yes, that's how XA recovery works, so I thought we can, kind of, reuse
> that.

The question in my mind is, when can the engine be allowed to forget that it
committed a transaction (eg. when would we "unlog" it)?

The best I can think of now is something like this:

1. For each transaction, at the point where that transaction becomes durable,
the engine informs TC about it.

2. When TC has been informed from engines that all transactions before a
certain point in the binlog have become durable, it sets that point as the
starting point for crash recovery.

3. When TC has stored the starting point for crash recovery durably, it can
"unlog" all transactions prior to that point to the engine.

This feels fairly complex to me. But maybe there is a simpler way?

In contrast, with consistent commit order, the engine just needs to remember
the id of the last committed transaction. It can probably just write it in its
transaction log during prepare.

Still,

>> The other side of this is if there is really any performance gain in
>> switching it off. My intension with the design was that there should
>> not be any performance penalty from it.
>
> Exactly. That's why I said "if it'll help" :)

The main performance bottleneck I am introducing is, I think, the
serialisation of the commit_ordered() part of commit. Not just for some
particular engine implementation, but for the interface. That is not a
decision to be taken lightly.

Of course, compared to InnoDB today, it's much better, as it gets rid of
the InnoDB prepare_commit_mutex (which spans all the way from end of prepare()
to end of what is effectively commit_ordered()), and also avoids taking
LOCK_log over all of log_xid() in the binlog.

But for something like NDB, I think serialised commit order would really hurt
(if it even makes sense ...)

Maybe the answer here is that engines can choose to support commit_ordered()
or not (and NDB-like engines probably will not). And if not, there is just no
support for consistent commit order.

And if we implement the simple way to recover engines from binlog without
fsync() in prepare() and commit(), then it will only work for engines
supporting commit_ordered(). Later we could implement the more complex
recovery without need for commit_ordered() support.

 - Kristian.

(PS. I haven't forgotten any of your comments, I will get to updating the
worklog descriptions eventually. I suppose I'm just a bit overwhelmed at the
moment ;-)
Follow ups

Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Sergei Golubchik, 2010-09-07
References

Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Sergei Golubchik, 2010-09-02
Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Kristian Nielsen, 2010-09-03
Re: Architecture review of MWL#116 "Efficient group commit for binary log"
From: Sergei Golubchik, 2010-09-03