← Back to team overview

maria-developers team mailing list archive

Re: [Commits] 0f97f6b8398: MDEV-17346 parallel slave start and stop races to workers disappeared

 

Hi Andrei!

andrei.elkin@xxxxxxxxxx writes:

> revision-id: 0f97f6b8398054ccb0507fbacc76c9deeddd47a4 (mariadb-10.1.35-71-g0f97f6b8398)
> author: Andrei Elkin
> timestamp: 2018-10-03 15:42:12 +0300
> message:
>
> MDEV-17346  parallel slave start and stop races to workers disappeared

Ooh, that's a nice catch. The code around slave start/stop really is
terribly tricky, glad that you were able to sort this one out.

Patch looks good to push.

What about test case? I assume this was caught from occasional random
failures in buildbot? That should be ok as test case for this bug, I think.
It can be a _lot_ of work to write reliable tests for stuff like this, and
the start/stop slave is stressed quite a bit by the existing replication
tests.

> The bug appears as a slave SQL thread hanging in
> rpl_parallel_thread_pool::get_thread() while there are no slave worker
> threads to awake it.
>
> The hang could occur at parallel slave worker pool activation by a
> "new" started SQL thread when the pool was concurrently deactivated by
> being terminated "old" SQL thread. At reading the current pool size
> the SQL thread did not employ necessary protection designed by
> MDEV-9573. The pool can't be deactivated when there is an active slave
> but the "new" slave might set its active status too late while seeing
> the pool still non-empty. The end product of four computaional events
>
>   Old_slave:any_slave_sql_running()  => 0
>   New_slave:slave_running= "active"
>   New_slave:global_rpl_thread_pool.size > 0 => true
>   Old_slave:global_rpl_thread_pool.size := 0
>
> could led to the observed hang. The new SQL thread proceeds to scheduling
> having all workers gone.
>
> Fixed with making the SQL thread at the pool activation first
> to grab the same lock as potential deactivator also does prior
> to destroy the pool.

At first I did not see how this fixes the problem. But it is because of the
extra check inside rpl_parallel_change_thread_count():

  if (!new_count && !force) {
    if (any_slave_sql_running()) {
      pool_mark_not_busy(pool);
      return 0;                                 // Ok to not resize pool
    }
  }

So one possibility is that New_slave has set running=active when Old_slave
hits that test. Then Old_slave will leave the pool intact thanks to this
test.

The other possibility is that New_slave has not yet set running=active at
this point, so Old_slave can empty the pool. But then Old_slave already
marked the pool busy. So the patch will ensure that New_slave will see that
the pool has been emptied, and will know to re-populate it before starting.

Seems solid.

Thanks for fixing!

 - Kristian.

> ---
>  sql/rpl_parallel.cc | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 35cddee6d4d..8fef2d66635 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -1617,13 +1617,32 @@ int rpl_parallel_resize_pool_if_no_slaves(void)
>  }
>  
>  
> +/**
> +  Pool activation is preceeded by taking a "lock" of pool_mark_busy
> +  which guarantees the number of running slaves drops to zero atomicly
> +  with the number of pool workers.
> +  This resolves race between the function caller thread and one
> +  that may be attempting to deactivate the pool.
> +*/
>  int
>  rpl_parallel_activate_pool(rpl_parallel_thread_pool *pool)
>  {
> +  int rc= 0;
> +
> +  if ((rc= pool_mark_busy(pool, current_thd)))
> +    return rc;   // killed
> +
>    if (!pool->count)
> -    return rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> -                                            0);
> -  return 0;
> +  {
> +    pool_mark_not_busy(pool);
> +    rc= rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> +                                         0);
> +  }
> +  else
> +  {
> +    pool_mark_not_busy(pool);
> +  }
> +  return rc;
>  }