maria-discuss team mailing list archive
-
maria-discuss team
-
Mailing list archive
-
Message #04203
Re: Semaphore hangs
-
To:
maria-discuss@xxxxxxxxxxxxxxxxxxx
-
From:
Jon Foster <jon@xxxxxxxxxxxxxxxxxxx>
-
Date:
Wed, 14 Dec 2016 19:50:03 -0800
-
In-reply-to:
<b1b2cf79-29ec-c298-2109-b5fe24e66228@yeshuashalom.org>
-
Organization:
JF Possibilities, Inc.
-
User-agent:
Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Thunderbird/45.4.0
On 12/08/2016 04:16 PM, Jon Foster wrote:
On 12/08/2016 03:13 PM, Daniel Black wrote:
On 09/12/16 09:43, Jon Foster wrote:
On 12/07/2016 06:04 PM, Daniel Black wrote:
On 08/12/16 08:51, Jon Foster wrote:
We are having trouble with MariaDB hanging due to a "semaphore wait". We
then have to shut MariaDB down as it typically won't recover, unless it
restarts itself, which happens if we wait long enough. But if its gone
on long enough MariaDB won't even shutdown, it hangs indefinitely
waiting for some other internal service. I don't remember the exact name
and we've been fast enough I haven't seen it in a while.
We've had the database on two completely different servers and still see
the problem. Both servers were bought new for this project and are a
year or less old. They are running all SSD drives, Debian 7 64bit with
MariaDB 10.1 from the MariaDB APT repository.
Since the XtraDB engine was usually mentioned in the logged messages we
switched back to the Oracle InnoDB engine. Although this seems to have
reduced the frequency it didn't fix it.
Can anyone give some advice on fixing this. It really seams like a bug
in MariaDB. I'll try to provide any needed info.
[...]
So its happened again on Tuesday (12/13) morning, early enough the east
coasters got it before I was aware of it (they are 3hrs ahead and I was
just getting up). Unfortunately I wasn't able to try the "gdb" request from
the previous discussion on this topic. So I've been looking for ways to
cross reference all the thread and mutexes mentioned to try and pinpoint
where the failure is happening.
This crash produced over 430MB of log data. I sliced out the first InnoDB
monitor dump (a mere 1.5MB) and stripped it down to just the threads and
related messages. I'm still reviewing the logs but I found something I
thought was interesting enough I'd throw it out here and see if anyone had
any thoughts.
There were 4,894 threads listed in the dump. But it appears that everyone
was waiting for one thread. Here is what the log said about that one thread:
06:01:42 --Thread 139879467059968 has waited at trx0sys.ic line 431 for
0.00 seconds the semaphore:
06:01:42 Mutex at 0x7f3a09a92068 created file trx0sys.cc line 729, lock var 0
06:01:42 Last time reserved by thread 18446744073709551615 in file not yet
reserved line 0, waiters flag 0
I trimmed out the data and server name to shorten the lines. Several
interesting things to note:
1. Thread 18446744073709551615 doesn't exist in the InnoDB monitor dump.
2. All of the other thread IDs are 15 digits. This one is 20 digits.
3. Over a thousand other threads are waiting on this one because it
apparently has the lock_sys->mutex mutex. All of the remaining threads are
waiting on those others.
4. This thread shows a 0 second wait time when many of the other threads
say they've been waiting over 250 seconds.
Sure looks like the mutex is being held by a non-existent thread. Memory
corruption?
I'm still looking over the logs so I might find some other stuff or
something else to point the finger at. But I thought I'd throw this out
there and see if anyone has some insight. Or maybe I should be taking this
issue to another list or report it as a bug?
THX - Jon
--
Sent from my Debian Linux workstation -- http://www.debian.org/intro/about
Jon Foster
JF Possibilities, Inc.
jon@xxxxxxxxxxxxxxxxxxx
541-410-2760
Making computers work for you!
References