← Back to team overview

maria-developers team mailing list archive

Re: Memory barrier problem in InnoDB/XtraDB mutex implementation causing hangs

 

Hi Kristian,

On Thu, Nov 06, 2014 at 03:12:55PM +0100, Kristian Nielsen wrote:
> I have been debugging some test failures seen in Buildbot:
> 
>     https://mariadb.atlassian.net/browse/MDEV-7026
> 
> After some debugging, this is what I think is going on. I will refer to
> MariaDB 5.5 sources, though the problem is also in 10.0.
> 
> InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in
> mutex_exit_func().
> 
> Simplifying for clarity, here are the relevant parts of the logic:
> 
> ENTER:
> 
>   # First spin wait, hoping to get the mutex without yielding.
>   os_compare_and_swap_ulint(&mutex->waiters, 0, 1);
Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"?
FWICS waiters of InnoDB mutex are non-atomic loads/stores.

>   if (os_atomic_test_and_set_byte(&mutex->lock_word, 1))
>     sync_array_wait_event()
> 
> RELEASE:
> 
>   os_atomic_lock_release_byte(&mutex->lock_word);
>   os_rmb;
>   waiters = *(volatile *)&mutex->waiters;
>   if (waiters)
>     mutex_signal_object(mutex);
> 
> The interesting point here is how to ensure that we do not lose the wakeup of
> a waiting thread when the holding thread releases the mutex.
> 
> The idea is that the waiting thread first sets mutex->waiters=1. Then it
> checks if the mutex is free - if not, it goes to sleep.
> 
> The thread that releases the lock first marks the mutex free. Then it checks
> if there are any waiters, and if so wakes them up.
> 
> For this to work, we need to ensure that the setting of waiters=1 has
> completed before the load of lock_word can start - a full memory barrier.
> Likewise, we need the setting of lock_word to 0 to complete before the load of
> mutex->waiters - again a full memory barrier. Otherwise we can get something
> like this:
> 
>   ENTER                               RELEASE
>   Start store to mutex->waiters
>                                       Start store to lock_word
>   Load mutex->lock_word
>                                       Store to lock_word becomes visible
>                                       Load mutex->waiters
>   Store to waiters becomes visible
>   Check lock word==1
>                                       Check waiters==0, skip wakeup
>   Sleep (forever)
> 
> Or this:
> 
>   ENTER                               RELEASE
>                                       Start store to lock_word
>   Start store to mutex->waiters
>                                       Load mutex->waiters
>   Store to waiters becomes visible
>   Load mutex->lock_word
>                                       Store to lock_word becomes visible
>   Check lock word==1
>   Sleep (forever)
>                                       Check waiters==0, skip wakeup
> 
> (So both full barriers are needed, the one in ENTER to prevent the first
> example, the one in RELEASE to prevent the second).
Agree so far.

> 
> It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and
> 10.0.13.
> 
> os_compare_and_swap_ulint() is defined as either
> __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a
> full barrier. So ENTER looks ok.
See above: waiters loads/stores are not atomic, so no memory barriers at all.

> 
> os_atomic_lock_release_byte() however is now defined as __sync_lock_release().
> This is _not_ a full barrier, just a release barrier. So the RELEASE case
> seems to be incorrect, allowing the second deadlock example to occur.
> 
> Previously, RELEASE used os_atomic_test_and_set_byte(), which is
> __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire"
> barrier only, which is also wrong, obviously. But it happens to generate an
> XCHG instruction on x86, which is a full barrier. This is probably why it
> works on x86 (and probably why it did not work on PowerPC).
> 
> Here is a small test program, and the assembly from GCC 5.7 on amd64:
> 
>   void test(unsigned *p, unsigned *q) {
>     while (__sync_lock_test_and_set(q, 1)) { }
>     *p++;
>     __sync_lock_release(q);
>   }
> 
>   test:
>           movl    $1, %edx
>   .L2:
>           movl    %edx, %eax
>           xchgl   (%rsi), %eax
>           testl   %eax, %eax
>           jne     .L2
>           movl    $0, (%rsi)
>           ret
> 
> As can be seen, the __sync_lock_test_and_set() generates a full barrier, but
> the __sync_lock_release() generates nothing (as this is sufficient for release
> semantics for the rather strong memory order on x86).
> 
> So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and
> 10.0.13 :-(
> 
> This is consistent with the observations in MDEV-7026. A thread is waiting for
> wakeup on a mutex that is already marked free.
Ok, so you say this hang is now repeatable with x86 because we changed
ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86.

Note that we also added mysterious os_isync to mutex_exit_func, which is
supposed to be among other things ACQUIRE. But on x86 it is no-op with a
comment:
/* Performance regression was observed at some conditions for Intel
architecture. Disable memory barrier for Intel architecture for now. */

> 
> Apparently this is not a new problem. This comment in mutex_exit_func() makes
> me want to cry:
> 
>     /* A problem: we assume that mutex_reset_lock word
>     is a memory barrier, that is when we read the waiters
>     field next, the read must be serialized in memory
>     after the reset. A speculative processor might
>     perform the read first, which could leave a waiting
>     thread hanging indefinitely.
> 
>     Our current solution call every second
>     sync_arr_wake_threads_if_sema_free()
>     to wake up possible hanging threads if
>     they are missed in mutex_signal_object. */
Hehe, you're not alone who cried about this comment. :)

> 
> So someone found the mutex implementation broken, but instead of fixing it,
> they implemented a horrible workaround to try and break any deadlocks that
> might occur.
> 
> (The reason that the deadlocks are not broken in the MDEV-7026 seems to be
> that it occurs during startup, before starting the srv_error_monitor_thread()
> that does the sync_arr_wake_threads_if_sema_free()).
> 
> Now, what is also interesting is that this seems to explain very nicely the
> many server lockups we have seen for users that upgrade to recent MariaDB
> versions. I was doing a very detailed investigation of one such case where the
> user was able to supply unusually detailed information. I found a thread that
> was stuck waiting on a mutex, but no other threads seemed to be holding that
> mutex. This is exactly what would be seen in case of a lost wakeup due to this
> issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as
> well as the normal InnoDB "long semaphore wait", was itself blocked on that
> same mutex wait, which explains why they were not effective for the user in
> that particular case.
Does it explain server lockups at startup or during normal operations? Could you
elaborate the problem you're talking about?

> 
> So it looks like we need to revert the incorrect patch and release new 5.5.40a
> and 10.0.15 releases ASAP, to solve these hangs for users.
> 
> And do a proper patch for PowerPC with proper full memory
> barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some
> PowerPC patch, it also looks completely wrong).
I believe reverting this patch is too late: we already claimed compatibility
with Power8. I'd vote for putting full memory barrier into mutex_exit_func()
and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :(

Though a proper solution should never issue full memory barriers for mutex
lock/unlock. I suggested different idea, but that's for 10.1:
https://mariadb.atlassian.net/browse/MDEV-6654

That's unfortunate, but we attempted to fix wrong thing by wrong patch. I was
kind of against this:
https://mariadb.atlassian.net/browse/MDEV-6515
<quot>
- mutex_get_waiters() miss acquire memory barrier. This may cause
  mutex_exit_func() read stale 'waiters' value and be the reason
  for deadlock.

  There seem to be a workaround for that: srv_error_monitor_thread()
  is supposed to wake these stale threads every second. But if that's
  the case, we don't really need release memory barrier in
  mutex_set_waiters().
</quot>

s/acquire memory barrier/full memory barrier/

> 
> However, more importantly, this issue clearly shows some fundamental problems
> in our development procedures that we need to fix ASAP.
> 
> Just look at it. We have users (and customers!) that suffer from strange
> hangs. This is their production systems that are down. Support and developers
> are unable to help them due to the difficulty of debugging rare race
> conditions in production systems.
> 
> And all the time the bug was staring us in the face, easily reproducable in
> our own Buildbot! But we deliberately chose to ignore those failures. The only
> thing that matters is "when can you push", and getting more releases
> out. Never mind what kind of crap is then pushed out to users.
> 
> I am deeply ashamed of treating our users like this. And extremely unhappy. I
> even wrote about this (the importance of not ignoring Buildbot failures) just
> a few weeks ago. This was completely ignored, I got zero replies, and the
> weekly calls continued with mindless juggling of Jira numbers and buggy
> releases.
> 
> As a result, our users suffer.
> 
> We need to stop this.
> 
> Please?
> 
>  - Kristian.
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp

Regards,
Sergey


Follow ups

References