← Back to team overview

maria-developers team mailing list archive

Re: 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

 

Hi, Jan!

On Oct 15, Jan Lindström wrote:
> Few questions:
> 
> (1) Is this review for a full patch or just problems on
> wsrep_abort_transaction ?

a full patch

> (2) In case at wsrep_abort_transaction we do not have a transaction idea is
> that we do not anymore want to enter InnoDB i.e. innobase_kill_query,
> that is the reason we set MUST_ABORT to wsrep_conflict_state so that no
> more mutexes from InnoDB is acquired.

right, sorry. I thought I've removed too much from
wsrep_abort_transaction.

> (3) In wsrep_close_connections code used LOCK_thread_lock, is that
> enough to protect THDs on list from concurrent disconnect and delete?
> (I added locking for THD::LOCK_thd_data but I was not sure is it
> necessary)

Isn't it a separate issue?

Anyway, as far as I understand, LOCK_thread_count protects the list of
threads, a THD cannot be added to or deleted from the list.

but a THD can be disconnected or killed and it'll wait somewhere at the
very end of the destructor to remove itself from the list.
So, yes, you need to take LOCK_thd_data or LOCK_thd_kill too.

> (4) Do we still keep manual KILL as TOI ?

Not in my patch, no.

> (5) What I do with thr_lock.c there we pass victim THD pointer to
> wsrep_abort_thd and it is as you pointed out mostly unprotected on
> that code, LOCK TABLE is TOI

It's fine. Same logic as in wsrep_innobase_kill_one_trx().
You kill a victim thd that owns thr locks. This THD cannot just
disappear, it has to release those locks first. And you own the
lock->mutex. So the THD isn't going anywhere until you return from
thr_lock().

> (6) Please note that fix will not be ready 19th of October as we again
> change critical path, Ramesh will run QA with pquery

sure

> On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
> > Hi, Jan!
> >
> > Here's an idea of the fix:
> >
> > Let's always use the KILL mutex locking order, that is
> >
> >   victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
> >
> > For this we need to fix wsrep_abort_transaction(), which is called from the
> > server, and wsrep_innobase_kill_one_trx(), which is called from BF
> > thread.
> >
> > wsrep_abort_transaction() can be fixed by not invoking
> > wsrep_innobase_kill_one_trx() and always using KILL code path (that is
> > wsrep_thd_awake) and forcing rollback after the kill.
> >
> > wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data
> > at all, just don't lock it. We know that the victim waits on a lock
> > inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim
> > cannot go away, cannot modify its data, it cannot do anything. So,
> > LOCK_thd_data doesn't seem to be necessary at that point.
> >
> > I've attached a demo patch. It compiles, but I didn't try to run it,
> > it's only to show the idea, not a working fix (I already suspect I
> > removed too much from wsrep_abort_transaction()). Note it's the patch
> > for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
> >
> > On Oct 12, Jan Lindström wrote:
> > > Hi Sergei,
> > >
> > > Update on wsrep_close_connections problem. My suggestion to fix this
> > issue
> > > is on
> > >
> > https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebea3b9d
> > >
> > > If you have a better solution, please advise.
> > >
> > > R: Jan
> >
Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx


Follow ups

References