maria-discuss team mailing list archive

Thread
Date

Re: Backup on the replication server getting affected

To: ragul rangarajan <ragulrangarajan@xxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 09 Jun 2023 13:55:46 +0200
Cc: MariaDB discuss <maria-discuss@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAExRPvFfev4vn_er26=9QA48PwUhKekt-sYVOEFt1kGAePCNOw@mail.gmail.com> (ragul rangarajan's message of "Mon, 29 May 2023 19:06:15 +0530")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

ragul rangarajan <ragulrangarajan@xxxxxxxxx> writes:

> Hope my issue is more related to the issue MDEV-30780 optimistic parallel
> slave hangs after hit an error
> Trying to reproduce with a minimal database.
>
> Attaching the gbd output

Thanks, that gdb output is really helpful!

I agree with Andrei that this rules out MDEV-30780 as the cause. Instead it
looks to be caused by MDEV-29843, see also MDEV-31427:

  https://jira.mariadb.org/browse/MDEV-29843
  https://jira.mariadb.org/browse/MDEV-31427

This is seen in the stack trace, where all the other worker threads are
waiting on one which is stuck inside pthread_cond_signal:

-----------------------------------------------------------------------
Thread 80 (Thread 0x7f47ad065700 (LWP 25417)):
#0  0x00007f789dca054d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f789dc9e14d in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x000055de401c23cd in inline_mysql_cond_signal (that=0x7f4798006b78) at /home/buildbot/buildbot/build/include/mysql/psi/mysql_thread.h:1099
#3  dec_pending_ops (state=<synthetic pointer>, this=0x7f4798006b30) at /home/buildbot/buildbot/build/sql/sql_class.h:2535
#4  thd_decrement_pending_ops (thd=0x7f47980009b8) at /home/buildbot/buildbot/build/sql/sql_class.cc:5142
#5  0x000055de407b5726 in group_commit_lock::release (this=this@entry=0x55de41f0da80 <write_lock>, num=num@entry=216757233923465)
    at /home/buildbot/buildbot/build/storage/innobase/log/log0sync.cc:388
#6  0x000055de407a0a3c in log_write_up_to (lsn=<optimized out>, lsn@entry=216757233923297, flush_to_disk=flush_to_disk@entry=false, rotate_key=rotate_key@entry=false, 
    callback=<optimized out>, callback@entry=0x7f47ad064090) at /home/buildbot/buildbot/build/storage/innobase/log/log0log.cc:844
-----------------------------------------------------------------------

The pthread_cond_signal() function normally can never block, so this
indicates some corruption of the underlying condition object. This object is
used to asynchroneously complete a query on a client connection when using
the thread pool. The MDEV-29843 patch makes worker threads not use this
asynchroneous completion, which should eliminate this problem.

The stack trace strongly indicates MDEV-29843 as the cause. Except that
MDEV-29843 patch is supposed to be in MariaDB 10.6.11, and you wrote:

> Environment: MariaDB 10.6.11

Can you double-check if you are really seing this hang in 10.6.11, or if it
could have been 10.6.10 (the only version that is supposed to be vulnerable
to MDEV-29843)?

Another thing you can check is if you are using
--thread-handling=pool-of-threads, which I think is related to the
MDEV-29843 issue. In MDEV-31427 I suggest
--thread-handling=one-thread-per-connection as a possible work-around.

Hope this helps,

 - Kristian.

Follow ups

Re: Backup on the replication server getting affected
From: ragul rangarajan, 2023-06-09

References

Backup on the replication server getting affected
From: ragul rangarajan, 2023-05-19
Re: Backup on the replication server getting affected
From: andrei . elkin, 2023-05-22
Re: Backup on the replication server getting affected
From: ragul rangarajan, 2023-05-29