maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #11436
Re: [Commits] 0f97f6b8398: MDEV-17346 parallel slave start and stop races to workers disappeared
-
To:
kristian.knielsen@xxxxxxxxxxxxxxx
-
From:
andrei.elkin@xxxxxxxxxx
-
Date:
Wed, 03 Oct 2018 15:55:20 +0300
-
Cc:
Michael Widenius <michael.widenius@xxxxxxxxx>, maria-developers@xxxxxxxxxxxxxxxxxxx
-
In-reply-to:
<201810031242.w93CgdPx003020@localhost.localdomain> (andrei elkin's message of "Wed, 3 Oct 2018 15:42:39 +0300")
-
Organization:
Home sweet home
-
Razorgate-kas:
Status: not_detected
-
Razorgate-kas:
Rate: 0
-
Razorgate-kas:
Envelope from:
-
Razorgate-kas:
Version: 5.5.3
-
Razorgate-kas:
LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29
-
Razorgate-kas:
Lua profiles 69136 [Nov 12 2014]
-
Razorgate-kas:
Method: none
-
User-agent:
Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)
Kristian, hello.
Perhaps you could review this worker pool patch though it actually follows up
changes made by MDEV-9573, by Monty.
Cheers,
Andrei
> revision-id: 0f97f6b8398054ccb0507fbacc76c9deeddd47a4 (mariadb-10.1.35-71-g0f97f6b8398)
> parent(s): 1fc5a6f30c3a9c047dcf9a36b00026d98f286f6b
> author: Andrei Elkin
> committer: Andrei Elkin
> timestamp: 2018-10-03 15:42:12 +0300
> message:
>
> MDEV-17346 parallel slave start and stop races to workers disappeared
>
> The bug appears as a slave SQL thread hanging in
> rpl_parallel_thread_pool::get_thread() while there are no slave worker
> threads to awake it.
>
> The hang could occur at parallel slave worker pool activation by a
> "new" started SQL thread when the pool was concurrently deactivated by
> being terminated "old" SQL thread. At reading the current pool size
> the SQL thread did not employ necessary protection designed by
> MDEV-9573. The pool can't be deactivated when there is an active slave
> but the "new" slave might set its active status too late while seeing
> the pool still non-empty. The end product of four computaional events
>
> Old_slave:any_slave_sql_running() => 0
> New_slave:slave_running= "active"
> New_slave:global_rpl_thread_pool.size > 0 => true
> Old_slave:global_rpl_thread_pool.size := 0
>
> could led to the observed hang. The new SQL thread proceeds to scheduling
> having all workers gone.
>
> Fixed with making the SQL thread at the pool activation first
> to grab the same lock as potential deactivator also does prior
> to destroy the pool.
>
> ---
> sql/rpl_parallel.cc | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 35cddee6d4d..8fef2d66635 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -1617,13 +1617,32 @@ int rpl_parallel_resize_pool_if_no_slaves(void)
> }
>
>
> +/**
> + Pool activation is preceeded by taking a "lock" of pool_mark_busy
> + which guarantees the number of running slaves drops to zero atomicly
> + with the number of pool workers.
> + This resolves race between the function caller thread and one
> + that may be attempting to deactivate the pool.
> +*/
> int
> rpl_parallel_activate_pool(rpl_parallel_thread_pool *pool)
> {
> + int rc= 0;
> +
> + if ((rc= pool_mark_busy(pool, current_thd)))
> + return rc; // killed
> +
> if (!pool->count)
> - return rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> - 0);
> - return 0;
> + {
> + pool_mark_not_busy(pool);
> + rc= rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> + 0);
> + }
> + else
> + {
> + pool_mark_not_busy(pool);
> + }
> + return rc;
> }
>
>
> _______________________________________________
> commits mailing list
> commits@xxxxxxxxxxx
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits