← Back to team overview

maria-developers team mailing list archive

Re: 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

 

Hi Sergei,

I have implemented PlanE as agreed on
branch bb-10.2-MDEV-25114-planE-galera and mostly regression testing looks
promising. However,
I have problems with MDL-locks. For example test case
galera.galera_toi_lock_exclusive hangs and I have not yet found out why. I
will ask
help from Seppo.

R: Jan

On Fri, Oct 15, 2021 at 11:36 AM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:

> Hi, Jan!
>
> On Oct 15, Jan Lindström wrote:
> > Few questions:
> >
> > (1) Is this review for a full patch or just problems on
> > wsrep_abort_transaction ?
>
> a full patch
>
> > (2) In case at wsrep_abort_transaction we do not have a transaction idea
> is
> > that we do not anymore want to enter InnoDB i.e. innobase_kill_query,
> > that is the reason we set MUST_ABORT to wsrep_conflict_state so that no
> > more mutexes from InnoDB is acquired.
>
> right, sorry. I thought I've removed too much from
> wsrep_abort_transaction.
>
> > (3) In wsrep_close_connections code used LOCK_thread_lock, is that
> > enough to protect THDs on list from concurrent disconnect and delete?
> > (I added locking for THD::LOCK_thd_data but I was not sure is it
> > necessary)
>
> Isn't it a separate issue?
>
> Anyway, as far as I understand, LOCK_thread_count protects the list of
> threads, a THD cannot be added to or deleted from the list.
>
> but a THD can be disconnected or killed and it'll wait somewhere at the
> very end of the destructor to remove itself from the list.
> So, yes, you need to take LOCK_thd_data or LOCK_thd_kill too.
>
> > (4) Do we still keep manual KILL as TOI ?
>
> Not in my patch, no.
>
> > (5) What I do with thr_lock.c there we pass victim THD pointer to
> > wsrep_abort_thd and it is as you pointed out mostly unprotected on
> > that code, LOCK TABLE is TOI
>
> It's fine. Same logic as in wsrep_innobase_kill_one_trx().
> You kill a victim thd that owns thr locks. This THD cannot just
> disappear, it has to release those locks first. And you own the
> lock->mutex. So the THD isn't going anywhere until you return from
> thr_lock().
>
> > (6) Please note that fix will not be ready 19th of October as we again
> > change critical path, Ramesh will run QA with pquery
>
> sure
>
> > On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@xxxxxxxxxxx>
> wrote:
> > > Hi, Jan!
> > >
> > > Here's an idea of the fix:
> > >
> > > Let's always use the KILL mutex locking order, that is
> > >
> > >   victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
> > >
> > > For this we need to fix wsrep_abort_transaction(), which is called
> from the
> > > server, and wsrep_innobase_kill_one_trx(), which is called from BF
> > > thread.
> > >
> > > wsrep_abort_transaction() can be fixed by not invoking
> > > wsrep_innobase_kill_one_trx() and always using KILL code path (that is
> > > wsrep_thd_awake) and forcing rollback after the kill.
> > >
> > > wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data
> > > at all, just don't lock it. We know that the victim waits on a lock
> > > inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim
> > > cannot go away, cannot modify its data, it cannot do anything. So,
> > > LOCK_thd_data doesn't seem to be necessary at that point.
> > >
> > > I've attached a demo patch. It compiles, but I didn't try to run it,
> > > it's only to show the idea, not a working fix (I already suspect I
> > > removed too much from wsrep_abort_transaction()). Note it's the patch
> > > for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
> > >
> > > On Oct 12, Jan Lindström wrote:
> > > > Hi Sergei,
> > > >
> > > > Update on wsrep_close_connections problem. My suggestion to fix this
> > > issue
> > > > is on
> > > >
> > >
> https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebea3b9d
> > > >
> > > > If you have a better solution, please advise.
> > > >
> > > > R: Jan
> > >
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx
>

References