maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #11447
Re: [Commits] 0f97f6b8398: MDEV-17346 parallel slave start and stop races to workers disappeared
Hi Andrei!
andrei.elkin@xxxxxxxxxx writes:
> revision-id: 0f97f6b8398054ccb0507fbacc76c9deeddd47a4 (mariadb-10.1.35-71-g0f97f6b8398)
> author: Andrei Elkin
> timestamp: 2018-10-03 15:42:12 +0300
> message:
>
> MDEV-17346 parallel slave start and stop races to workers disappeared
Ooh, that's a nice catch. The code around slave start/stop really is
terribly tricky, glad that you were able to sort this one out.
Patch looks good to push.
What about test case? I assume this was caught from occasional random
failures in buildbot? That should be ok as test case for this bug, I think.
It can be a _lot_ of work to write reliable tests for stuff like this, and
the start/stop slave is stressed quite a bit by the existing replication
tests.
> The bug appears as a slave SQL thread hanging in
> rpl_parallel_thread_pool::get_thread() while there are no slave worker
> threads to awake it.
>
> The hang could occur at parallel slave worker pool activation by a
> "new" started SQL thread when the pool was concurrently deactivated by
> being terminated "old" SQL thread. At reading the current pool size
> the SQL thread did not employ necessary protection designed by
> MDEV-9573. The pool can't be deactivated when there is an active slave
> but the "new" slave might set its active status too late while seeing
> the pool still non-empty. The end product of four computaional events
>
> Old_slave:any_slave_sql_running() => 0
> New_slave:slave_running= "active"
> New_slave:global_rpl_thread_pool.size > 0 => true
> Old_slave:global_rpl_thread_pool.size := 0
>
> could led to the observed hang. The new SQL thread proceeds to scheduling
> having all workers gone.
>
> Fixed with making the SQL thread at the pool activation first
> to grab the same lock as potential deactivator also does prior
> to destroy the pool.
At first I did not see how this fixes the problem. But it is because of the
extra check inside rpl_parallel_change_thread_count():
if (!new_count && !force) {
if (any_slave_sql_running()) {
pool_mark_not_busy(pool);
return 0; // Ok to not resize pool
}
}
So one possibility is that New_slave has set running=active when Old_slave
hits that test. Then Old_slave will leave the pool intact thanks to this
test.
The other possibility is that New_slave has not yet set running=active at
this point, so Old_slave can empty the pool. But then Old_slave already
marked the pool busy. So the patch will ensure that New_slave will see that
the pool has been emptied, and will know to re-populate it before starting.
Seems solid.
Thanks for fixing!
- Kristian.
> ---
> sql/rpl_parallel.cc | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 35cddee6d4d..8fef2d66635 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -1617,13 +1617,32 @@ int rpl_parallel_resize_pool_if_no_slaves(void)
> }
>
>
> +/**
> + Pool activation is preceeded by taking a "lock" of pool_mark_busy
> + which guarantees the number of running slaves drops to zero atomicly
> + with the number of pool workers.
> + This resolves race between the function caller thread and one
> + that may be attempting to deactivate the pool.
> +*/
> int
> rpl_parallel_activate_pool(rpl_parallel_thread_pool *pool)
> {
> + int rc= 0;
> +
> + if ((rc= pool_mark_busy(pool, current_thd)))
> + return rc; // killed
> +
> if (!pool->count)
> - return rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> - 0);
> - return 0;
> + {
> + pool_mark_not_busy(pool);
> + rc= rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads,
> + 0);
> + }
> + else
> + {
> + pool_mark_not_busy(pool);
> + }
> + return rc;
> }