maria-developers team mailing list archive

Thread
Date

Re: [Commit] MDEV-12746 out-of-order retry

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: andrei.elkin@xxxxxxxxxx
Date: Wed, 14 Feb 2018 18:00:06 +0200
Cc: MariaDB Developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <871shobwrl.fsf@urd.knielsen-hq.org> (Kristian Nielsen's message of "Tue, 13 Feb 2018 18:54:38 +0100")
Organization: Home sweet home
Razorgate-kas: Status: not_detected
Razorgate-kas: Rate: 0
Razorgate-kas: Envelope from:
Razorgate-kas: Version: 5.5.3
Razorgate-kas: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29
Razorgate-kas: Lua profiles 69136 [Nov 12 2014]
Razorgate-kas: Method: none
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Hello.

Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

> andrei.elkin@xxxxxxxxxx writes:
>
>> First the parent errros out goes to `finish_event_group()' but it's
>> possible it does not have yet the child in its `subsequent_commits_list'
>
>> So I understood so far that the retrying worker needs to check 
>> `stop_on_error_sub_id' that effectively reflects the error status.
>
> Agree, the basic fix looks correct, of checking stop_on_error_sub_id.
>
> And yes, there are probably cases where after stop_on_error_sub_id is set,
> earlier transactions are aborted early without setting up the
> subsequent-transaction-chain, so that wait_for_prior_commit cannot be relied
> upon, and stop_on_error_sub_id must be checked.
>
> So again, nice catch, this would be nasty to try and debug from errors seen
> only in a user's setup.

Thanks!

>
> I am only wondering about two points, if there is something that needs to be
> adjusted in the patch around the error exit after stop_on_error_sub_id, or
> perhaps something else not yet understood going on:
>
>
>
> Number one:
>
>>>> +      if (!rgi->worker_error)
>>>> +        rgi->worker_error= 1;
>>>> +    }
>>>
>>> I would have expected an abort (goto err) here (or well just below, after
>>> unlocking the mutex). Is this not needed, and if so, why not?
>>
>> I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It
>> takes care to check out `worker_error'.
>>
>>   int wait_for_prior_commit(THD *thd)
>>   {
>>     /*
>>       Quick inline check, to avoid function call and locking in the common case
>>       where no wakeup is registered, or a registered wait was already signalled.
>>     */
>>     if (waitee)
>>       return wait_for_prior_commit2(thd);
>>     else
>>     {
>>       if (wakeup_error)
>>         my_error(ER_PRIOR_COMMIT_FAILED, MYF(0));
>>       return wakeup_error;
>>     }
>>   }
>>
>> No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
>
> Sorry, I don't get it. wait_for_prior_commit() checks
> thd->wait_for_commit_ptr->wakeup_error, not rgi->worker_error?
> wait_for_prior_commit() doesn't have access to rgi.

Thanks for this check. Indeed, I did not fully recover from earlier
confusion between the two (too much closely named (in some metrics)
identifier they are!).

 thd->wait_for_commit_ptr->wakeup_error := 1

is meant even though the test passes (must be thanks to
rgi->worker_error later checks).


>
>> list. And that's how the parent's `wakeup_error' never reaches the
>> child. This is certainly the case when a child retries after a temp
>> failure. In the test condition the child's `worker_error' is zero, but
>> if a non-zero case `worker_error' gets cleared anyway through
>> `unregister_wait_for_prior_commit()'.
>
> Isn't there some confusion here? Between "wakeup_error", which sits inside
> struct wait_for_commit, and indicates that a prior event group failed. And
> "worker_error" in rpl_group_info, which indicates that _this_ event group
> failed, and is used to remember to do proper error cleanup inside the worker
> thread?

To fix.

>
> Number two:
>
>> You must mean the retrying worker stops doing it optimistically. I saw that.
>> But I saw through logs that at times a worker 
>> could re-try about the number of worker pool size. 
>> Perhaps this unrestrained behaviour was caused by lack of a check of the
>> current patch.
>
> Yes, that is what I mean. And yes, the underlying bug with missing check on
> stop_on_error_sub_id might have caused this as a secondary effect.
>
> My point was just that users should not need to set a high retry count just
> from using parallel replication (remember, several thousand worker threads
> have been used with success on production data). So if this is necessary in
> a test case, it might indicate a problem with the code...

True.
I am making changes to submit a newer version very soon.

Thanks for your help!

Andrei

>
>  - Kristian.

Follow ups

[Fixed Commit] MDEV-12746 out-of-order retry
From: andrei . elkin, 2018-02-14

References

Re: [Commit] MDEV-12746 out-of-order retry
From: Kristian Nielsen, 2018-02-10
Re: [Commit] MDEV-12746 out-of-order retry
From: andrei . elkin, 2018-02-12
Re: [Commit] MDEV-12746 out-of-order retry
From: Kristian Nielsen, 2018-02-13