← Back to team overview

maria-developers team mailing list archive

Re: 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

 

Hi Sergei,

Your suggestion does not work. There are more than one problem

(1) wsrep_abort_transaction does not release MDL-lock
(2) innobase_kill_one_trx crashes at wsrep->abort_pre_commit() because
transaction registered inside wsrep has disappeared (this does not happen
if THD::LOCK_thd_data is locked)

==> I will use Seppo's KILL as TOI and remove all unrelated changes from
mdl.cc and wsrep_close_connections, they are not related to problems we
need to fix.
Let's take one problem at a time. TOI is a very powerful solution to the
problem we are trying to fix.

R: Jan

On Thu, Oct 21, 2021 at 7:52 AM Jan Lindström <jan.lindstrom@xxxxxxxxxxx>
wrote:

> Hi Sergei,
>
> This does not seem to work. Consider following:
>
> CREATE TABLE t1 (id INT PRIMARY KEY) ENGINE=InnoDB;
> INSERT INTO t1 VALUES (1);
> connection node_2;
> SET AUTOCOMMIT=OFF;
> START TRANSACTION;
> INSERT INTO t1 VALUES (2);
> connection node_2a;
> ALTER TABLE t1 ADD COLUMN f2 INTEGER, LOCK=EXCLUSIVE;
>
> Problem seems to be the fact that the MDL-lock acquired by thread
> executing INSERT that thread executing ALTER wants to be released by
> killing holder is
> not released. We do kill query inside InnoDB.  This MDL-code is not
> familiar to me and I do not yet understand why MDL-lock is not released
>
> R: Jan
>
> On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
>> Hi, Jan!
>>
>> Here's an idea of the fix:
>>
>> Let's always use the KILL mutex locking order, that is
>>
>>   victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
>>
>> For this we need to fix wsrep_abort_transaction(), which is called from
>> the
>> server, and wsrep_innobase_kill_one_trx(), which is called from BF
>> thread.
>>
>> wsrep_abort_transaction() can be fixed by not invoking
>> wsrep_innobase_kill_one_trx() and always using KILL code path (that is
>> wsrep_thd_awake) and forcing rollback after the kill.
>>
>> wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data
>> at all, just don't lock it. We know that the victim waits on a lock
>> inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim
>> cannot go away, cannot modify its data, it cannot do anything. So,
>> LOCK_thd_data doesn't seem to be necessary at that point.
>>
>> I've attached a demo patch. It compiles, but I didn't try to run it,
>> it's only to show the idea, not a working fix (I already suspect I
>> removed too much from wsrep_abort_transaction()). Note it's the patch
>> for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
>>
>> On Oct 12, Jan Lindström wrote:
>> > Hi Sergei,
>> >
>> > Update on wsrep_close_connections problem. My suggestion to fix this
>> issue
>> > is on
>> >
>> https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebea3b9d
>> >
>> > If you have a better solution, please advise.
>> >
>> > R: Jan
>>
>> Regards,
>> Sergei
>> VP of MariaDB Server Engineering
>> and security@xxxxxxxxxxx
>>
>

References