maria-discuss team mailing list archive

Thread
Date

Re: Semaphore hangs

To: maria-discuss@xxxxxxxxxxxxxxxxxxx
From: Jon Foster <jon@xxxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Dec 2016 19:50:03 -0800
In-reply-to: <b1b2cf79-29ec-c298-2109-b5fe24e66228@yeshuashalom.org>
Organization: JF Possibilities, Inc.
User-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Thunderbird/45.4.0



On 12/08/2016 04:16 PM, Jon Foster wrote:

On 12/08/2016 03:13 PM, Daniel Black wrote:

On 09/12/16 09:43, Jon Foster wrote:

On 12/07/2016 06:04 PM, Daniel Black wrote:

On 08/12/16 08:51, Jon Foster wrote:

We are having trouble with MariaDB hanging due to a "semaphore wait". We
then have to shut MariaDB down as it typically won't recover, unless it
restarts itself, which happens if we wait long enough. But if its gone
on long enough MariaDB won't even shutdown, it hangs indefinitely
waiting for some other internal service. I don't remember the exact name
and we've been fast enough I haven't seen it in a while.

We've had the database on two completely different servers and still see
the problem. Both servers were bought new for this project and are a
year or less old. They are running all SSD drives, Debian 7 64bit with
MariaDB 10.1 from the MariaDB APT repository.

Since the XtraDB engine was usually mentioned in the logged messages we
switched back to the Oracle InnoDB engine. Although this seems to have
reduced the frequency it didn't fix it.

Can anyone give some advice on fixing this. It really seams like a bug
in MariaDB. I'll try to provide any needed info.

[...]

So its happened again on Tuesday (12/13) morning, early enough the eastcoasters got it before I was aware of it (they are 3hrs ahead and I wasjust getting up). Unfortunately I wasn't able to try the "gdb" request fromthe previous discussion on this topic. So I've been looking for ways tocross reference all the thread and mutexes mentioned to try and pinpointwhere the failure is happening.

This crash produced over 430MB of log data. I sliced out the first InnoDBmonitor dump (a mere 1.5MB) and stripped it down to just the threads andrelated messages. I'm still reviewing the logs but I found something Ithought was interesting enough I'd throw it out here and see if anyone hadany thoughts.

There were 4,894 threads listed in the dump. But it appears that everyonewas waiting for one thread. Here is what the log said about that one thread:

06:01:42 --Thread 139879467059968 has waited at trx0sys.ic line 431 for0.00 seconds the semaphore:

06:01:42 Mutex at 0x7f3a09a92068 created file trx0sys.cc line 729, lock var 0

06:01:42 Last time reserved by thread 18446744073709551615 in file not yetreserved line 0, waiters flag 0

I trimmed out the data and server name to shorten the lines. Severalinteresting things to note:


1. Thread 18446744073709551615 doesn't exist in the InnoDB monitor dump.
2. All of the other thread IDs are 15 digits. This one is 20 digits.

3. Over a thousand other threads are waiting on this one because itapparently has the lock_sys->mutex mutex. All of the remaining threads arewaiting on those others.4. This thread shows a 0 second wait time when many of the other threadssay they've been waiting over 250 seconds.

Sure looks like the mutex is being held by a non-existent thread. Memorycorruption?

I'm still looking over the logs so I might find some other stuff orsomething else to point the finger at. But I thought I'd throw this outthere and see if anyone has some insight. Or maybe I should be taking thisissue to another list or report it as a bug?



THX - Jon

--
Sent from my Debian Linux workstation -- http://www.debian.org/intro/about

Jon Foster
JF Possibilities, Inc.
jon@xxxxxxxxxxxxxxxxxxx
541-410-2760
Making computers work for you!

References

Semaphore hangs
From: Jon Foster, 2016-12-07
Re: Semaphore hangs
From: Daniel Black, 2016-12-08
Re: Semaphore hangs
From: Daniel Black, 2016-12-08