← Back to team overview

maria-developers team mailing list archive

Re: 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

 

Hi,

Few questions:

(1) Is this review for a full patch or just problems on
wsrep_abort_transaction ?
(2) In case at wsrep_abort_transaction we do not have a transaction idea is
that we do not anymore want to enter InnoDB i.e. innobase_kill_query,
that is the reason we set MUST_ABORT to wsrep_conflict_state so that no
more mutexes from InnoDB is acquired.
(3) In wsrep_close_connections code used LOCK_thread_lock, is that enough
to protect THDs on list from concurrent disconnect and delete? (I added
locking
for THD::LOCK_thd_data but I was not sure is it necessary)
(4) Do we still keep manual KILL as TOI ?
(5) What I do with thr_lock.c there we pass victim THD pointer to
wsrep_abort_thd and it is as you pointed out mostly unprotected on that
code,
LOCK TABLE is TOI
(6) Please note that fix will not be ready 19th of October as we again
change critical path, Ramesh will run QA with pquery

On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:

> Hi, Jan!
>
> Here's an idea of the fix:
>
> Let's always use the KILL mutex locking order, that is
>
>   victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
>
> For this we need to fix wsrep_abort_transaction(), which is called from the
> server, and wsrep_innobase_kill_one_trx(), which is called from BF
> thread.
>
> wsrep_abort_transaction() can be fixed by not invoking
> wsrep_innobase_kill_one_trx() and always using KILL code path (that is
> wsrep_thd_awake) and forcing rollback after the kill.
>
> wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data
> at all, just don't lock it. We know that the victim waits on a lock
> inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim
> cannot go away, cannot modify its data, it cannot do anything. So,
> LOCK_thd_data doesn't seem to be necessary at that point.
>
> I've attached a demo patch. It compiles, but I didn't try to run it,
> it's only to show the idea, not a working fix (I already suspect I
> removed too much from wsrep_abort_transaction()). Note it's the patch
> for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
>
> On Oct 12, Jan Lindström wrote:
> > Hi Sergei,
> >
> > Update on wsrep_close_connections problem. My suggestion to fix this
> issue
> > is on
> >
> https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebea3b9d
> >
> > If you have a better solution, please advise.
> >
> > R: Jan
>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx
>

Follow ups

References