← Back to team overview

maria-developers team mailing list archive

Memory barrier problem in InnoDB/XtraDB mutex implementation causing hangs

 

I have been debugging some test failures seen in Buildbot:

    https://mariadb.atlassian.net/browse/MDEV-7026

After some debugging, this is what I think is going on. I will refer to
MariaDB 5.5 sources, though the problem is also in 10.0.

InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in
mutex_exit_func().

Simplifying for clarity, here are the relevant parts of the logic:

ENTER:

  # First spin wait, hoping to get the mutex without yielding.
  os_compare_and_swap_ulint(&mutex->waiters, 0, 1);
  if (os_atomic_test_and_set_byte(&mutex->lock_word, 1))
    sync_array_wait_event()

RELEASE:

  os_atomic_lock_release_byte(&mutex->lock_word);
  os_rmb;
  waiters = *(volatile *)&mutex->waiters;
  if (waiters)
    mutex_signal_object(mutex);

The interesting point here is how to ensure that we do not lose the wakeup of
a waiting thread when the holding thread releases the mutex.

The idea is that the waiting thread first sets mutex->waiters=1. Then it
checks if the mutex is free - if not, it goes to sleep.

The thread that releases the lock first marks the mutex free. Then it checks
if there are any waiters, and if so wakes them up.

For this to work, we need to ensure that the setting of waiters=1 has
completed before the load of lock_word can start - a full memory barrier.
Likewise, we need the setting of lock_word to 0 to complete before the load of
mutex->waiters - again a full memory barrier. Otherwise we can get something
like this:

  ENTER                               RELEASE
  Start store to mutex->waiters
                                      Start store to lock_word
  Load mutex->lock_word
                                      Store to lock_word becomes visible
                                      Load mutex->waiters
  Store to waiters becomes visible
  Check lock word==1
                                      Check waiters==0, skip wakeup
  Sleep (forever)

Or this:

  ENTER                               RELEASE
                                      Start store to lock_word
  Start store to mutex->waiters
                                      Load mutex->waiters
  Store to waiters becomes visible
  Load mutex->lock_word
                                      Store to lock_word becomes visible
  Check lock word==1
  Sleep (forever)
                                      Check waiters==0, skip wakeup

(So both full barriers are needed, the one in ENTER to prevent the first
example, the one in RELEASE to prevent the second).

It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and
10.0.13.

os_compare_and_swap_ulint() is defined as either
__sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a
full barrier. So ENTER looks ok.

os_atomic_lock_release_byte() however is now defined as __sync_lock_release().
This is _not_ a full barrier, just a release barrier. So the RELEASE case
seems to be incorrect, allowing the second deadlock example to occur.

Previously, RELEASE used os_atomic_test_and_set_byte(), which is
__sync_lock_test_and_set(). This is defined in GCC docs as an "aquire"
barrier only, which is also wrong, obviously. But it happens to generate an
XCHG instruction on x86, which is a full barrier. This is probably why it
works on x86 (and probably why it did not work on PowerPC).

Here is a small test program, and the assembly from GCC 5.7 on amd64:

  void test(unsigned *p, unsigned *q) {
    while (__sync_lock_test_and_set(q, 1)) { }
    *p++;
    __sync_lock_release(q);
  }

  test:
          movl    $1, %edx
  .L2:
          movl    %edx, %eax
          xchgl   (%rsi), %eax
          testl   %eax, %eax
          jne     .L2
          movl    $0, (%rsi)
          ret

As can be seen, the __sync_lock_test_and_set() generates a full barrier, but
the __sync_lock_release() generates nothing (as this is sufficient for release
semantics for the rather strong memory order on x86).

So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and
10.0.13 :-(

This is consistent with the observations in MDEV-7026. A thread is waiting for
wakeup on a mutex that is already marked free.

Apparently this is not a new problem. This comment in mutex_exit_func() makes
me want to cry:

    /* A problem: we assume that mutex_reset_lock word
    is a memory barrier, that is when we read the waiters
    field next, the read must be serialized in memory
    after the reset. A speculative processor might
    perform the read first, which could leave a waiting
    thread hanging indefinitely.

    Our current solution call every second
    sync_arr_wake_threads_if_sema_free()
    to wake up possible hanging threads if
    they are missed in mutex_signal_object. */

So someone found the mutex implementation broken, but instead of fixing it,
they implemented a horrible workaround to try and break any deadlocks that
might occur.

(The reason that the deadlocks are not broken in the MDEV-7026 seems to be
that it occurs during startup, before starting the srv_error_monitor_thread()
that does the sync_arr_wake_threads_if_sema_free()).

Now, what is also interesting is that this seems to explain very nicely the
many server lockups we have seen for users that upgrade to recent MariaDB
versions. I was doing a very detailed investigation of one such case where the
user was able to supply unusually detailed information. I found a thread that
was stuck waiting on a mutex, but no other threads seemed to be holding that
mutex. This is exactly what would be seen in case of a lost wakeup due to this
issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as
well as the normal InnoDB "long semaphore wait", was itself blocked on that
same mutex wait, which explains why they were not effective for the user in
that particular case.

So it looks like we need to revert the incorrect patch and release new 5.5.40a
and 10.0.15 releases ASAP, to solve these hangs for users.

And do a proper patch for PowerPC with proper full memory
barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some
PowerPC patch, it also looks completely wrong).

However, more importantly, this issue clearly shows some fundamental problems
in our development procedures that we need to fix ASAP.

Just look at it. We have users (and customers!) that suffer from strange
hangs. This is their production systems that are down. Support and developers
are unable to help them due to the difficulty of debugging rare race
conditions in production systems.

And all the time the bug was staring us in the face, easily reproducable in
our own Buildbot! But we deliberately chose to ignore those failures. The only
thing that matters is "when can you push", and getting more releases
out. Never mind what kind of crap is then pushed out to users.

I am deeply ashamed of treating our users like this. And extremely unhappy. I
even wrote about this (the importance of not ignoring Buildbot failures) just
a few weeks ago. This was completely ignored, I got zero replies, and the
weekly calls continued with mindless juggling of Jira numbers and buggy
releases.

As a result, our users suffer.

We need to stop this.

Please?

 - Kristian.


Follow ups