maria-developers team mailing list archive

Thread
Date

Re: [Commits] 0f97f6b8398: MDEV-17346 parallel slave start and stop races to workers disappeared

To: kristian.knielsen@xxxxxxxxxxxxxxx
From: andrei.elkin@xxxxxxxxxx
Date: Wed, 03 Oct 2018 15:55:20 +0300
Cc: Michael Widenius <michael.widenius@xxxxxxxxx>, maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <201810031242.w93CgdPx003020@localhost.localdomain> (andrei elkin's message of "Wed, 3 Oct 2018 15:42:39 +0300")
Organization: Home sweet home
Razorgate-kas: Status: not_detected
Razorgate-kas: Rate: 0
Razorgate-kas: Envelope from:
Razorgate-kas: Version: 5.5.3
Razorgate-kas: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29
Razorgate-kas: Lua profiles 69136 [Nov 12 2014]
Razorgate-kas: Method: none
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Kristian, hello.

Perhaps you could review this worker pool patch though it actually follows up
changes made by MDEV-9573, by Monty.

Cheers,

Andrei


> revision-id: 0f97f6b8398054ccb0507fbacc76c9deeddd47a4 (mariadb-10.1.35-71-g0f97f6b8398)
> parent(s): 1fc5a6f30c3a9c047dcf9a36b00026d98f286f6b
> author: Andrei Elkin
> committer: Andrei Elkin
> timestamp: 2018-10-03 15:42:12 +0300
> message:
>
> MDEV-17346  parallel slave start and stop races to workers disappeared
>
> The bug appears as a slave SQL thread hanging in
> rpl_parallel_thread_pool::get_thread() while there are no slave worker
> threads to awake it.
>
> The hang could occur at parallel slave worker pool activation by a
> "new" started SQL thread when the pool was concurrently deactivated by
> being terminated "old" SQL thread. At reading the current pool size
> the SQL thread did not employ necessary protection designed by
> MDEV-9573. The pool can't be deactivated when there is an active slave
> but the "new" slave might set its active status too late while seeing
> the pool still non-empty. The end product of four computaional events
>
>   Old_slave:any_slave_sql_running()  => 0
>   New_slave:slave_running= "active"
>   New_slave:global_rpl_thread_pool.size > 0 => true
>   Old_slave:global_rpl_thread_pool.size := 0
>
> could led to the observed hang. The new SQL thread proceeds to scheduling
> having all workers gone.
>
> Fixed with making the SQL thread at the pool activation first
> to grab the same lock as potential deactivator also does prior
> to destroy the pool.
>
> ---
>  sql/rpl_parallel.cc | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 35cddee6d4d..8fef2d66635 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -1617,13 +1617,32 @@ int rpl_parallel_resize_pool_if_no_slaves(void)
>  }
>  
>  
> +/**
> +  Pool activation is preceeded by taking a "lock" of pool_mark_busy
> +  which guarantees the number of running slaves drops to zero atomicly
> +  with the number of pool workers.
> +  This resolves race between the function caller thread and one
> +  that may be attempting to deactivate the pool.
> +*/
>  int
>  rpl_parallel_activate_pool(rpl_parallel_thread_pool *pool)
>  {
> +  int rc= 0;
> +
> +  if ((rc= pool_mark_busy(pool, current_thd)))
> +    return rc;   // killed
> +
>    if (!pool->count)
> -    return rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> -                                            0);
> -  return 0;
> +  {
> +    pool_mark_not_busy(pool);
> +    rc= rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> +                                         0);
> +  }
> +  else
> +  {
> +    pool_mark_not_busy(pool);
> +  }
> +  return rc;
>  }
>  
>  
> _______________________________________________
> commits mailing list
> commits@xxxxxxxxxxx
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits