← Back to team overview

maria-developers team mailing list archive

Re: Memory barrier problem in InnoDB/XtraDB mutex implementation causing hangs



thanks for your answers! My comments inline.

On Thu, Nov 06, 2014 at 06:48:18PM +0100, Kristian Nielsen wrote:
> Sergey Vojtovich <svoj@xxxxxxxxxxx> writes:
> >>   # First spin wait, hoping to get the mutex without yielding.
> >>   os_compare_and_swap_ulint(&mutex->waiters, 0, 1);
> > Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"?
> > FWICS waiters of InnoDB mutex are non-atomic loads/stores.
> mutex_enter_func() calls mutex_spin_wait() which does
> mutex_set_waiters(mutex, 1):
>     mutex_set_waiters(mutex_t *mutex, ulint n) {
>             if (n) {
>                     os_compare_and_swap_ulint(&mutex->waiters, 0, 1);
>             } else {
>                     os_compare_and_swap_ulint(&mutex->waiters, 1, 0);
>             }
> So that is where I saw it. I do not understand what you mean with
> "waiters of InnoDB mutex are non-atomic loads/stores" ?
Never seen this before. InnoDB of 5.5 and InnoDB/XtraDB of 10.0 both have:

        ib_mutex_t*     mutex,  /*!< in: mutex */
        ulint           n)      /*!< in: value to set */
        volatile ulint* ptr;            /* declared volatile to ensure that
                                        the value is stored to memory */

        ptr = &(mutex->waiters);


        *ptr = n;               /* Here we assume that the write of a single
                                word in memory is atomic */

But I confirm that I see this in XtraDB of 5.5.

> >> os_compare_and_swap_ulint() is defined as either
> >> __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a
> >> full barrier. So ENTER looks ok.
> > See above: waiters loads/stores are not atomic, so no memory barriers at all.
> Well, the code is hopelessly complicated with various #ifdef cases. I do not
> claim to have verified all cases that they have the correct barrier for ENTER.
> However, I found that in my particular build, mutex_set_waiters used the
> INNODB_RW_LOCKS_USE_ATOMICS case, which uses os_compare_and_swap_ulint().
> And that is defined in os0sync.h as either __sync_bool_compare_and_swap(),
> which is documented as having a full barrier; or atomic_cas_ulong(), which I
> think is the MySQL's own atomics, which on x86 were implemented using LOCK
> prefix last time I looked, so also full barrier.
> Why do you think there is no full barrier? Did I miss something?
You're right. I didn't check XtraDB of 5.5, I did check InnoDB instead.

> >> This is consistent with the observations in MDEV-7026. A thread is waiting for
> >> wakeup on a mutex that is already marked free.
> > Ok, so you say this hang is now repeatable with x86 because we changed
> > ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86.
> I have not tried reproducing on x86.
> Note that on x86 we changed it from full barrier (XCHG instruction) to aquire
> (which does nothing if my test program is to be believed). The documentation
> only guarantees AQUIRE for __sync_lock_test_and_set(), true, but in reality it
> seems to have been full barrier on x86, which would explain why it worked
> before.
> I am speculating that some of the hangs that started to appear recently could
> be due to this change. At least one case, that I researched in considerable
> depth, can be explained adequately in this way. It seems likely that this is
> the explanation, but I cannot be sure.
Agree. Hangs can happen in the following situations:
- error monitor thread is not started (that's mostly for startup)
- error monitor thread is stuck itself (can happen any time)

> > Note that we also added mysterious os_isync to mutex_exit_func, which is
> > supposed to be among other things ACQUIRE. But on x86 it is no-op with a
> > comment:
> > /* Performance regression was observed at some conditions for Intel
> > architecture. Disable memory barrier for Intel architecture for now. */
> I saw the added isync. I think this is also wrong. isync is about speculative
> execution, one can use it with a conditional instruction to prevent
> instructions being executed ahead of the completion of a load. But the place I
> saw, it was used incorrectly, without the conditional instruction. I did not
> dig deeper into this though, as even if used correctly it would still not be
> the full barrier that seems to be required here.
Agree. What we basically do is:
lock_word= 0; // atomic store with release barrier
if (waiters) // non-atomic load

Here we need to ensure that store completes before load. But:
- there is no isync on x86
- isync doesn't guarantee StoreLoad order (there are many articles about this)

So full memory barrier seem to be the only option.

> > Does it explain server lockups at startup or during normal operations? Could you
> > elaborate the problem you're talking about?
> So my guess is that on x86, the full barriers are in place, and the mutex code
> works without losing wakeups. However, at some point there was some problem,
> maybe some specific compiler or architecture or whatever, and the
> sync_arr_wake_threads_if_sema_free() hack was implemented. I am guessing that
> things work correctly on x86 because the sync_arr_wake_threads_if_sema_free()
> is fundamentally broken. It is not active during startup (and maybe
> shutdown?). And it does not work if the server error monitor thread happens to
> also get stuck on a mutex. So I think missing barriers on x86 would have been
> discovered in commonly used binaries. And the
> sync_arr_wake_threads_if_sema_free() has been able to hide problems in rare
> architectures. That is my guess.
We're on the same wave.

> As to the concrete problems:
> 1. MDEV-7026 there is a lockup during startup. I believe this is visible
> because the server error monitor thread is not running at this point.

> 2. In a user hang I investigated in detail, a thread was waiting on
> log_sys->mutex, but no other thread was holding that mutex (consistent with
> lost wakeup). The server error monitor thread was itself stuck on a mutex,
> which would prevent sync_arr_wake_threads_if_sema_free() from hiding the
> problem.
Strange... Monty should have fixed this. Error monitor thread should call
log_get_lsn_nowait(), which basically does trylock. Do you happen to have call

> > with Power8. I'd vote for putting full memory barrier into mutex_exit_func()
> > and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :(
> Yes, I agree, that is what I meant.
> I agree that it was probably broken on PowerPC also before the change. But it
> looked to me like we had full barriers on x86 before the change (even if it
> was not guaranteed by documentation).
Correct. It was dead-broken on PPC and it had barrier on x86 before.

> > Though a proper solution should never issue full memory barriers for mutex
> > lock/unlock. I suggested different idea, but that's for 10.1:
> This sounds very strange. Are you sure?
> Normally, mutex lock/unlock is assumed to be a full barrier. If my reading of
> the code is correct, this is also how it was implemented on x86. It seems
> _very_ likely that part of the InnoDB code relies on such full barriers. It
> does not seem safe to me to change it.
> Posix thread operations are for example defined to imply full barrier.
> Or do you mean that there should be _one_ full barrier implied in
> mutex_enter(), and one in mutex_exit()? But there are currently more than one,
> and this should be optimised? That would make sense. Though my personal
> opinion is that the InnoDB mutex implementation has always been a piece of
> garbage, and rather than mess more with it, it should be replaced with
> something simpler and better (isn't this what the InnoDB team is already
> working on?).
Do you have a link to articles that define full memory barrier for mutex
lock/unlock? Probably that was quite outdated writing. See e.g.:

...Atomic load with memory_order_acquire or stronger is an acquire operation.
The lock() operation on a Mutex is also an acquire operation...

...Atomic store with memory_order_release or stronger is a release operation.
The unlock() operation on a Mutex is also a release operation...

Full memory barriers are way too heavy even for mutexes. All we need to to is
to ensure that:
- all loads following lock are performed after lock (acquire)
- all stores preceding unlock are completed before unlock (release)


Follow ups