← Back to team overview

maria-discuss team mailing list archive

Re: Semaphore hangs

 


On 09/12/16 09:43, Jon Foster wrote:
> On 12/07/2016 06:04 PM, Daniel Black wrote:
>> On 08/12/16 08:51, Jon Foster wrote:
>>> We are having trouble with MariaDB hanging due to a "semaphore wait". We
>>> then have to shut MariaDB down as it typically won't recover, unless it
>>> restarts itself, which happens if we wait long enough. But if its gone
>>> on long enough MariaDB won't even shutdown, it hangs indefinitely
>>> waiting for some other internal service. I don't remember the exact name
>>> and we've been fast enough I haven't seen it in a while.
>>>
>>> We've had the database on two completely different servers and still see
>>> the problem. Both servers were bought new for this project and are a
>>> year or less old. They are running all SSD drives, Debian 7 64bit with
>>> MariaDB 10.1 from the MariaDB APT repository.
>>>
>>> Since the XtraDB engine was usually mentioned in the logged messages we
>>> switched back to the Oracle InnoDB engine. Although this seems to have
>>> reduced the frequency it didn't fix it.
>>>
>>> Can anyone give some advice on fixing this. It really seams like a bug
>>> in MariaDB. I'll try to provide any needed info.

Its either a bug or you are running without sufficient hardware.

If you rule out the later I suggest opening a report at jira.mariadb.org


>>> TIA - Jon
>>>
>> Needed info:
>> * exact MariaDB version 10.1.X
>> * mariadb configuration
>> * what is the entire mysql log from the startup until the "semaphore
>> wait"
>> * how are you determining "semaphore wait"
>> * is a SHOW FULL PROCESS LIST possible? if so contents?
>> * when in a stalled state run on a commmand line (with gdb installed)
>> gdb --batch --eval-command="thread apply all bt" program $(pidof mysqld)
>>
> Version: 10.10.19+maria-1~wheezy, but we started having problems with
> 5.5.x which after a few updates prompted us to try 10.1.
> 
> Configuration:
> -----------------------
> [mysqld_safe]
> socket          = /var/run/mysqld/mysqld.sock
> nice            = -10

Niceing below 0 isn't that productive. -10 might be interfering with
kernel IO processes that are needed to enable userspace threads to resume.

No high IO/CPU/RAM should be running on the same host however if you do
-1 or -2 might help achieve a higher priority than the other process.

> [mysqld]
> user            = mysql
> pid-file        = /var/run/mysqld/mysqld.pid
> socket          = /var/run/mysqld/mysqld.sock
> port            = 3306
> basedir         = /usr
> datadir         = /ssd/mysql
> tmpdir          = /ssd/tmp
> lc_messages_dir = /usr/share/mysql
> lc_messages     = en_US
> skip-external-locking
> # NOTE: these two were recently added to troubleshoot
> ignore_builtin_innodb
> plugin_load=innodb=ha_innodb.so
> bind-address            = 127.0.0.1
> skip-name-resolve
> max_connections         = 8000

At the time of the crash you seem to have about 4k threads

zgrep 12:35:18  /tmp/mozilla_dan0/mysql-crash-20161207.log.gz | grep '
--Thread' | cut -f 8 -d ' ' | sort -u | wc -l
4158

I suspect you don't have enough grunt in hardware to support this.

I do suspect that this level is significantly above standard load.

I'd restrict max_connections to something you have seen your hardware
handle.

> connect_timeout         = 5
> wait_timeout            = 60
> max_allowed_packet      = 1M
> thread_cache_size       = 100
> sort_buffer_size        = 1M
> bulk_insert_buffer_size = 16M
> tmp_table_size          = 32M
> max_heap_table_size     = 10M
> join_buffer_size        = 1M
> myisam_recover          = BACKUP
> key_buffer_size         = 128M
> open-files-limit        = 65535
> table_open_cache        = 15000
> myisam_sort_buffer_size = 128M
> concurrent_insert       = 2
> read_buffer_size        = 256K
> read_rnd_buffer_size    = 256K
> query_cache_limit               = 1M
> query_cache_size                = 16M
> slow_query_log=0
> slow_query_log_file     = /var/log/mysql/mariadb-slow.log
> long_query_time = 3
> log_slow_verbosity      = query_plan
> binlog_format           = ROW
> expire_logs_days        = 10
> max_binlog_size         = 100M
> default_storage_engine  = InnoDB
> innodb_buffer_pool_size = 2600M
> innodb_log_buffer_size  = 256M
> innodb_file_per_table   = 1
> innodb_open_files       = 400
> innodb_io_capacity      = 400
> innodb_adaptive_hash_index = 0
> # These two were added just prior to this last crash.
> innodb_buffer_pool_dump_at_shutdown = 1
> innodb_buffer_pool_load_at_startup = 1
> innodb_flush_log_at_trx_commit = 0

Setting ^ to 0 will mean less IO however it means that you can loose
stuff on your expected crash.

> innodb_support_xa              = 0
> sync_binlog                    = 0
> -----------------------
> 
> Logs: This most recent case is about 19M uncompressed. I have hosted it
> on my website here:
> http://www.jfpossibilities.com/static/pickup/mysql-crash-20161207.log.gz
> If you would like it delivered some other way let me know. I have about
> 300M of logs from a previous crash situation. That time was particularly
> bad since for some reason it would hang again within ten minutes of
> being restarted. By limiting the connections on our webservers I was
> able to baby it to get it going again.
> 
> I determined "semaphore wait" from the zillions of messages in the logs
> that say it. We shut down this one pretty quick. But I usually recognize
> the symptoms fairly early on: the number of DB connections escalates
> from ~1k up to the max 8k. The number of running queries as shown with
> "show processlist" escalates along with it going from less than 10 to
> hundreds or thousands depending on how long we wait.

sounds like max_connections of 100 might be more suitable. Matches
thread_cache_size.

> Then too the Apache
> connections on the web servers escalate. And of course the "load factor"
> on all the servers jumps up. Icinga usually gives us the heads up.

Is apache on the same machine? If so you configured Apache so that it
doesn't consume more memory and leave MySQL with no headroom. Of course
its better to have it on a different host.

Make sure Icinga doesn't show swapping.

Check your read and write latency on the SSD.

> Obviously your last two requests will have to wait until we are having a
> problem again. But I have gdb installed and ready to go... I think. I
> don't know much about it.


SHOW ENGINE INNODB STATUS would be good too.

experiment with gdb on a test machine in the same way. Even if its a
different process.



References