← Back to team overview

maria-developers team mailing list archive

Re: Architecture review of MWL#116 "Efficient group commit for binary log"


Hi, Kristian!

On Sep 03, Kristian Nielsen wrote:
> >>  - If we have consistent commit order, we can think about
> >>  optimising commit to need only one fsync (for binlog); lost
> >>  commits in storage engines can then be recovered from the binlog
> >>  at crash recovery by re-playing against the engine from a
> >>  particular point in the binlog.
> >
> > Why do you need consistent commit order for that?
> Hm. The case I have in mind is this:
> Now what will we do to recover? I guess we will have to replay A, B,
> and D but not C and E. So we need to get a _list_ of missing
> transactions from the engine somehow (and so also need something
> similar to unlog() against the

Yes, that's how XA recovery works, so I thought we can, kind of, reuse

> engine). We also have to ensure that replaying them in different order
> will work (I think it will, but not 100% sure).

It should. Two transactions can have different commit order in binlog
and the engine *only* if they got to ha_commit() almost at the same
time. By no means can one of these transactions depend on the other (as
locks or versioning would force the first transaction to be committed
first). So, as these transactions are completely independent, they can
be committed in any order.

> >>  - With consistent commit order, we can get better semantics for
> >>  transactions (and we could even get it to return also a matching
> >>  binlog position).  Currently, this "CONSISTENT SNAPSHOT" can be
> >>  inconsistent among multiple storage engines.
> >
> > Okay, although it's a corner case.
> >
> > It would be more important to have consistent reads working across
> > multiple storage engines without explicit START TRANSACTION WITH
> > CONSISTENT SNAPSHOT. But it's doable too.
> Do you mean per statement, or per transaction? For per statement, I
> agree. For transaction I do not see how, without having to in effect
> start a snapshot across all engines for every transaction?

oops, sorry. I thought I had a solution, but later I found a flaw in it
> >>  - In InnoDB, the performance in the presense of hotspots can be
> >>  improved if we can release row locks early in the commit phase,
> >>  but this requires that we release them in the same order as
> >>  commits in the binlog to ensure consistency between master and
> >>  slaves.
> >
> > Conceptually InnoDB can release the locks only after the commit was
> > made persistent.
> Well, that was my immediate thought as well. But the longer I have
> tried coming up with a counter-example, the less sure I am.
> Note that the Facebook patch implements this early release of locks.
> The thing is, this is not about releasing locks on modified rows. In
> InnoDB, modified rows are locked in effect by having a higher LSN in
> the row, so there is no explicit lock in memory like for gap locks in
> SERIALISABLE mode etc. So modified rows are still only unlocked when
> the transaction commits.

Ah, right. I meant locks for modified rows, indeed.
But now I understand what you're after, let me think about it...

It should be safe, probably even in SBR mode.
Because if one transaction waits on the lock set by another transaction,
this another will *always* commit first. See above - only truly
independent transactions may come to ha_commit() at the same time and
have their commit order reversed.
> >>  - There was some discussions around Galera [1] synchroneous
> >>  replication and global transaction ID that it needed consistent
> >>  commit order among participating engines.  - I believe there could
> >>  be other applications for guaranteed consistent commit order, and
> >>  that the architecture described in this worklog can implement such
> >>  guarantee with reasonable overhead.
> >
> > Okay, this much I agree with - there could be applications (galera,
> > may be) that need this consistent commit order. But because it's not
> > needed in many (if not most) cases, it'd be good to be able to
> > switch it off, if it'll help to gain performance.
> Yes.
> The other side of this is if there is really any performance gain in
> switching it off. My intension with the design was that there should
> not be any performance penalty from it.

Exactly. That's why I said "if it'll help" :)
> > ================  High-Level Specification ===================
> > ...
> >> As a consequence of the group_log_xid() optimisation, handlers must
> >> be aware that the commit_ordered() call can happen in another
> >> thread than the one running commit() (so thread local storage is
> >> not available).  This should not be a big issue as the THD is
> >> available for storing any needed information.
> >
> > I do not see how this is a consequence of the optimization.
> What I meant was, the optimisation is to run this in a single thread:
>     group_log_xid(<list of transactions>) for trx in (<list of
>     transactions>) trx->commit_ordered()
> so as a consequence, commit_ordered() for one transaction can be
> called from the thread context of another transaction.

I mean that even for group_log_xid() you can create a scheme with
conditions and mutexes that "bounces execution" to every transaction
thread to do the commit in there. So it's not really a "consequence of
the group_log_xid()."

Personally, I don't know how bad is it, as a limitation.
I think it should be pretty easy for the engine to use THD and store the
data there. As for my_error and similar things - well, we can either
defer them or use pthread_setspecific to set current_thd and
mysys_vars appropriately. Or let my_error report its error and after
commit move the error from the current_thd to the appropriate THD.
Or, really, we'll have error_reporting service eventually (hopefully
soon) and it'll solve the my_error problem.

Anyway, if the engine uses pthread_getspecific or gcc thread local
variables (__thread storage class - which is implemented not with
pthread_getspecific but with some ELF magic that I didn't bother to read
about) it will not work when we migrate the transaction from one thread
to another. But, again, I don't know how much we should care about this

> >> 1.1 Atomic queue of committing transactions
> >> 
> >> To keep the right commit order among participants, we put
> >> transactions into a queue. The operations on the queue are
> >> non-locking:
> >> 
> >>  - Insert THD at the head of the queue, and return old queue.
> >> 
> >>     THD *enqueue_atomic(THD *thd)
> >> 
> >>  - Fetch (and delete) the whole queue.
> >> 
> >>     THD *atomic_grab_reverse_queue()
> >> 
> >> These are simple to implement with atomic compare-and-set. Note
> >> that there is no ABA problem [2], as we do not delete individual
> >> elements from the queue, we grab the whole queue and replace it
> >> with NULL.
> >> 
> >> A transaction enters the queue when it does prepare_ordered(). This
> >> way, the scheduling order for prepare_ordered() calls is what
> >> determines the sequence in the queue and effectively the commit
> >> order.
> >
> > How can that work ? If you insert into a queue without a lock, you
> > cannot know what order of commits you eend up with. So, you cannot
> > have the same order as you've got fof prepare_ordered. Besides, if
> > prepare_ordered is, precisely, *ordered*, it must be done under a
> > lock.
> Right, you raise a number of issues below, and I think just answering
> each one individually will become hard to understand, so let me
> explain my thinking more in-depth.
> It would be possible to iterate over the queue to invoke
> prepare_ordered() in sequence from a single thread, just like
> group_log_xid() and commit_ordered(). But this would delay the calls
> until the previous group commit is done and the next one starts

No, why ? You only need one atomic fetch_and_store to copy the queue
head to a local variable and reset the queue. Then you can call
prepare_ordered or commit_ordered in the queue order without any mutex.

> It is correct that enqueue has to be done under the same
> LOCK_prepare_ordered to ensure the same order in queue and of
> prepare_ordered() calls. However, by making the queue lock-free, the
> _dequeue_ does not need to take LOCK_prepare_ordered, which is a
> global mutex that I wanted to reduce contention of as much as
> possible.

But you never dequeue. You reset the queue instead. It can be done under
the same mutex, at once. Resetting the queue is one assignment.
It's way faster to do that under the same mutex (and increase the mutex
hold time by the duration of one assignment) than to use lock-free
queue, which is a complicated algorithm with fallbacks, retries, threads
helping each other, and so on.
> The reason for the LOCK_commit_ordered is that there are two code
> paths that call commit_ordered(). One is the group commit leader
> thread that invokes commit_ordered() for all threads that have passed
> group_log_xid(). The other is when no 2-phase commit is done,
> ha_commit_one_phase(). It does not actually matter in which order the
> non-2-phase commit_ordered() calls happen, but I still wanted to
> provide the storage engine the guarantee that commit_ordered() would
> not be invoked by multiple threads in parallel. It just seemed cleaner
> that way to me.

Uhm, I don't know. The engine will have its own shared data structures
and locks anyway, so it'll ensure that commits cannot be executed in
parallel. I don't see why we need to take care of that.
> Another possibility would be to allow multiple calls in parallel into
> commit_ordered() (in the cases where no specific commit ordering
> between the parallel invocations is needed). Then the storage engine
> will have to be prepared for this situation, but LOCK_commit_ordered
> is no longer needed.

InnoDB will probably have its kernel lock anyway.
> > I see two possibilities of doing prepare_ordered and enqueue:
> >
> > 1.
> >
> >   pthread_mutex_lock(queue_lock);
> >   prepare_ordered(thd);
> >   enqueue(thd);
> >   pthread_mutex_unlock(queue_lock);
> >
> > in this case prepare_ordered calls are strictly serialized by a mutex.
> > There is no need for non-locking enqueue_atomic, because the same mutex
> > must be used to define the queue ordering.
> Correct, see above for why non-locking queue was used anyway.

Right, and as I said, in my opinion it does not give you anything
> > 2.
> >
> >   enqueue_atomic();
> >
> >   ... some time later ...
> >   for (thd=queue_head; thd; thd= thd->next_in_queue)
> >     prepare_ordered(thd);
> >
> > in this case insert into a queue is non-locking, the order of
> > elements in a queue is undefined. Later on, you iterate the queue
> > and call prepare_ordered in the queue order. In this implementation
> > all prepare_ordered and commit_ordered will likely be called from
> > one thread, not from the connection thread of this transaction.
> Yes, see above for why I prefer something like 1.

I don't insist on either 1 or 2.

But here, I think, calling prepare_ordered() in parallel should not be a
problem either - the engine will have its locks anyway.
> >> The queue is grabbed by the code doing group_log_xid() and
> >> commit_ordered() calls. The queue is passed directly to
> >> group_log_xid(), and afterwards iterated to do individual
> >> commit_ordered() calls.
> >> 
> >> Using a lock-free queue allows prepare_ordered() (for one
> >> transaction) to run in parallel with commit_ordered (in another
> >> transaction), increasing potential parallelism.
> >
> > Nope, in the approach 1 (with a mutex) as above, you can have this
> > too.  Simply as
> >
> >   1.
> >     ... on commit ...
> >     pthread_mutex_lock(queue_lock);
> >     commit_queue= queue;
> >     queue=NULL;
> >     pthread_mutex_unlock(queue_lock);
> >     ... now you can iterate the commit_queue in parallel ...
> >     ... with preparing other transactions ...
> >
> > actually you'll have to do it not on commit but before
> > group_log_xid().
> Yes, a simple mutex on the queue would work well. Of course, lock-free
> algorithms are all hot and cool, but if you prefer a simple
> mutex-protected queue instead I will change it.

Hot and cool, yes, I've used (and coded) them myself :)
But I only use them where I believe they actually improve concurrency,
and I think this one is not the case.

> So in the MWL#132 I managed to clean this up. Now there is nothing in
> THD, instead a small queue node is allocated on the stack in each
> thread, much nicer.
> I still have a per-thread mutex and condition in THD, since I did not
> know where else to put them without having to re-initialise on each
> transaction start. Do you have a better suggestion?

I'll write a reply to WL#132 later.
> >>         unlock(LOCK_group_commit)
> >> 
> >>         // Now we are done, so wake up all the others.
> >>         for (other IN TAIL(queue))
> >>             lock(other->LOCK_commit_ordered)
> >>             other->group_commit_ready= TRUE
> >>             cond_signal(other->COND_commit_ordered)
> >>             unlock(other->LOCK_commit_ordered)
> >
> > why do you want one condition per thread ?
> > You can use one condition per queue, such as:
> >
> >      else
> >          // If not the leader, just wait until leader did the work for us.
> >          lock(old_queue->LOCK_commit_ordered)
> >          while (!old_queue->group_commit_ready)
> >              cond_wait(old_queue->LOCK_commit_ordered, old_queue->COND_commit_ordered)
> >          unlock(old_queue->LOCK_commit_ordered)
> >
> > this way you'll need just one broadcast to wake up everybody.
> The reason is the following:
> With one condition, there will be also just one associated mutex. When
> the condition is broad-cast, every waiting thread will then contend
> for that one mutex. So one will wake up and get the mutex, then
> immediately release it, waking up the next, which will immediately
> release it, and so on.
> The end result is that the last thread will have to wait for context
> switch of every other thread before it can run. I did not like this
> unnecessary contention.
> In contrast, with one condition (and one mutex) per thread, the thread
> that does the group commit and runs all the commit_ordered() calls can
> just wake up all the others immediately, and each thread is free to
> run as soon or as late as scheduling resources permit.
> Again, I wanted to avoid forcing the kernel scheduler to bounce around
> from one thread to the next before the last thread can run.
> Do you see any advantage from using just one condition, instead of one
> per thread?

Reduce the amount of mutex locks and signals.
I don't know what's more expensive.

But I think that in many cases one can rewrite broadcast to a set of
signals and one condition per thread. It's kind of optimizing the thread
library itself - I typically prefer to use the library as designed and
leave its optimization to library developers.

But anyway, the fact is that I didn't benchmark and don't know for sure
what's better one broadcast or many signals.

> >>     else
> >>         // If not the leader, just wait until leader did the work for us.
> >>         lock(thd->LOCK_commit_ordered)
> >>         while (!thd->group_commit_ready)
> >>             cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
> >>         unlock(other->LOCK_commit_ordered)
> >
> > This doesn't work, I'm afraid. The leader can complete its job and
> > signal all threads in a queue before some of the queue threads has
> > locked their thd->LOCK_commit_ordered mutex. These threads will miss
> > the wakeup signal.
> No, it is not a problem. If the leader completes first, it will have
> set thd->group_commit_ready, and the thread will never enter
> cond_wait().  The variable thd->group_commit_ready is protected by the
> thd->LOCK_commit_ordered mutex for exactly this reason.


Still, what will happen if the context switch will stop the non-leader
thread between enqueue_atomic() and group_commit_ready=FALSE ?

Although, to fix that you don't need to chain locks, it could be
probably enough to set group_commit_ready=FALSE before enqueue.
Of course, resetting the queue in the same mutex would also help - and
it'd allow to use a simple locking queue instead of a lock-free one (as
I already said above).

Follow ups