← Back to team overview

maria-discuss team mailing list archive

Re: Semaphore hangs

 



On 12/08/2016 04:16 PM, Jon Foster wrote:
On 12/08/2016 03:13 PM, Daniel Black wrote:
On 09/12/16 09:43, Jon Foster wrote:
On 12/07/2016 06:04 PM, Daniel Black wrote:
On 08/12/16 08:51, Jon Foster wrote:
We are having trouble with MariaDB hanging due to a "semaphore wait". We
then have to shut MariaDB down as it typically won't recover, unless it
restarts itself, which happens if we wait long enough. But if its gone
on long enough MariaDB won't even shutdown, it hangs indefinitely
waiting for some other internal service. I don't remember the exact name
and we've been fast enough I haven't seen it in a while.

We've had the database on two completely different servers and still see
the problem. Both servers were bought new for this project and are a
year or less old. They are running all SSD drives, Debian 7 64bit with
MariaDB 10.1 from the MariaDB APT repository.

Since the XtraDB engine was usually mentioned in the logged messages we
switched back to the Oracle InnoDB engine. Although this seems to have
reduced the frequency it didn't fix it.

Can anyone give some advice on fixing this. It really seams like a bug
in MariaDB. I'll try to provide any needed info.
[...]
So its happened again on Tuesday (12/13) morning, early enough the east coasters got it before I was aware of it (they are 3hrs ahead and I was just getting up). Unfortunately I wasn't able to try the "gdb" request from the previous discussion on this topic. So I've been looking for ways to cross reference all the thread and mutexes mentioned to try and pinpoint where the failure is happening.

This crash produced over 430MB of log data. I sliced out the first InnoDB monitor dump (a mere 1.5MB) and stripped it down to just the threads and related messages. I'm still reviewing the logs but I found something I thought was interesting enough I'd throw it out here and see if anyone had any thoughts.

There were 4,894 threads listed in the dump. But it appears that everyone was waiting for one thread. Here is what the log said about that one thread:

06:01:42 --Thread 139879467059968 has waited at trx0sys.ic line 431 for 0.00 seconds the semaphore:
06:01:42 Mutex at 0x7f3a09a92068 created file trx0sys.cc line 729, lock var 0
06:01:42 Last time reserved by thread 18446744073709551615 in file not yet reserved line 0, waiters flag 0

I trimmed out the data and server name to shorten the lines. Several interesting things to note:

1. Thread 18446744073709551615 doesn't exist in the InnoDB monitor dump.
2. All of the other thread IDs are 15 digits. This one is 20 digits.
3. Over a thousand other threads are waiting on this one because it apparently has the lock_sys->mutex mutex. All of the remaining threads are waiting on those others. 4. This thread shows a 0 second wait time when many of the other threads say they've been waiting over 250 seconds.

Sure looks like the mutex is being held by a non-existent thread. Memory corruption?

I'm still looking over the logs so I might find some other stuff or something else to point the finger at. But I thought I'd throw this out there and see if anyone has some insight. Or maybe I should be taking this issue to another list or report it as a bug?


THX - Jon

--
Sent from my Debian Linux workstation -- http://www.debian.org/intro/about

Jon Foster
JF Possibilities, Inc.
jon@xxxxxxxxxxxxxxxxxxx
541-410-2760
Making computers work for you!



References