maria-developers team mailing list archive

Thread
Date
Re: Review of MDEV-4506, parallel replication, part 1

To: monty@xxxxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Fri, 13 Sep 2013 15:11:08 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <21042.47030.664904.739419@narttu.askmonty.org> (Michael Widenius's message of "Fri, 13 Sep 2013 09:59:02 +0300")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Monty,

Thanks for the comments so far.

I have pushed fixes to 10.0-knielsen, and some specific replies are below.

 - Kristian.

Michael Widenius <monty@xxxxxxxxxxxx> writes:

> + --slave-parallel-threads=# 
> + If non-zero, number of threads to spawn to apply in
> + parallel events on the slave that were group-committed on
> + the master or were logged with GTID in different
> + replication domains.
>
> What happens if --slave-parallel-threads is 0 ?
> Does this mean that there will be no parallel threads?
> Is this the same as 1 ?
>
> If same as 1, why not have the option to take values starting from 1 ?

I think it is logical that 0 means disabled and >0 means enabled with that
many threads. But I don't mind using 1 as disabled and needing >=2 for
enabled. Though the ability to use a pool with just 1 thread might be useful
for testing. It doesn't seem critical, maybe we can revisit this if time
allows when the basic stuff is done. 

> === modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result'
> --- mysql-test/suite/perfschema/r/dml_setup_instruments.result	2013-03-26 09:35:34 +0000
> +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result	2013-08-29 08:18:22 +0000
> @@ -38,14 +38,14 @@ order by name limit 10;
>  NAME	ENABLED	TIMED
>  wait/synch/cond/sql/COND_flush_thread_cache	YES	YES
>  wait/synch/cond/sql/COND_manager	YES	YES
> +wait/synch/cond/sql/COND_parallel_entry	YES	YES
> +wait/synch/cond/sql/COND_prepare_ordered	YES	YES
>  wait/synch/cond/sql/COND_queue_state	YES	YES
>  wait/synch/cond/sql/COND_rpl_status	YES	YES
> +wait/synch/cond/sql/COND_rpl_thread	YES	YES
> +wait/synch/cond/sql/COND_rpl_thread_pool	YES	YES
>  wait/synch/cond/sql/COND_server_started	YES	YES
>  wait/synch/cond/sql/COND_thread_cache	YES	YES
> -wait/synch/cond/sql/COND_thread_count	YES	YES
> -wait/synch/cond/sql/Delayed_insert::cond	YES	YES
> -wait/synch/cond/sql/Delayed_insert::cond_client	YES	YES
> -wait/synch/cond/sql/Event_scheduler::COND_state	YES	YES
>
> Any idea why the above was deleted?
> Don't see anything in your patch that could affect that.

I believe it does a SELECT with a LIMIT - so when new stuff is added before
the limit, stuff below gets pushed out.

But if you ask why the test is that way, I cannot answer. The purpose of some
of those perfschema tests is somewhat of a mystery, they seem to serve little
other purpose than add maintenance cost for result file update whenever
something changes ...

> I didn't find any test results for rpl_parallel.test or rpl_paralell2 in
> your tree. Did you forget to commit these?

There was no time to work on test cases yet. As we discussed I can look into
that as the next thing next week.


> @@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl
>  }
>  
> Could you please document in detail the arguments and return value
> for the following function.

I added the following comment:

/*
  Put a transaction that is ready to commit in the group commit queue.
  The transaction is identified by the ENTRY object passed into this function.

  To facilitate group commit for the binlog, we first queue up ourselves in
  this function. Then later the first thread to enter the queue waits for
  the LOCK_log mutex, and commits for everyone in the queue once it gets the
  lock. Any other threads in the queue just wait for the first one to finish
  the commit and wake them up. This way, all transactions in the queue get
  committed in a single disk operation.

  The return value of this function is TRUE if queued as the first entry in
  the queue (meaning this is the leader), FALSE otherwise.

  The main work in this function is when the commit in one transaction has
  been marked to wait for the commit of another transaction to happen
  first. This is used to support in-order parallel replication, where
  transactions can execute out-of-order but need to be committed in-order with
  how they happened on the master. The waiting of one commit on another needs
  to be integrated with the group commit queue, to ensure that the waiting
  transaction can participate in the same group commit as the waited-for
  transaction.

  So when we put a transaction in the queue, we check if there were other
  transactions already prepared to commit but just waiting for the first one
  to commit. If so, we add those to the queue as well, transitively for all
  waiters.
*/

bool
MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry)

I also moved some code from the caller into this function, which clarifies
things by keeping related code together. And I added more comments about what
is going on with the different list pointers, and who is doing what.

>    /*
>      To facilitate group commit for the binlog, we first queue up ourselves in
>      the group commit queue. Then the first thread to enter the queue waits for
>      the LOCK_log mutex, and commits for everyone in the queue once it gets the
>      lock. Any other threads in the queue just wait for the first one to finish
>      the commit and wake them up.
>
> Do you still use LOCK_log for this?
> It's a bit confusing to mention LOCK_log here when you take the
> LOCK_prepare_ordered in the function.

Yeah, this comment got moved around and is no longer appropriate. I tried to
clarify it (the taking of LOCK_log happens later, after this function).

> I did find it confusing that here you threat next_subsequent_commit as
> a way to avoid recursion, but you use this list for other things in
> register_wait_for_prior_commit().

I added this comment:

    We keep a list of all the waiters that need to be processed in `list',
    linked through the next_subsequent_commit pointer. Initially this list
    contains only the entry passed into this function.

    We process entries in the list one by one. The element currently being
    processed is pointed to by `cur`, and the element at the end of the list
    is pointed to by `last` (we do not use NULL to terminate the list).

    As we process an element, it is first added to the group_commit_queue.
    Then any waiters for that element are added at the end of the list, to
    be processed in subsequent iterations. This continues until the list
    is exhausted, with all elements ever added eventually processed.

    The end result is a breath-first traversal of the tree of waiters,
    re-using the next_subsequent_commit pointers in place of extra stack
    space in a recursive traversal.

> +  /* Now we need to clear the wakeup_subsequent_commits_running flags. */
> +  if (list)
> +  {
> +    for (;;)
> +    {
> +      if (list->wakeup_subsequent_commits_running)
> +      {
> +        mysql_mutex_lock(&list->LOCK_wait_commit);
> +        list->wakeup_subsequent_commits_running= false;
> +        mysql_mutex_unlock(&list->LOCK_wait_commit);
>
> Why do we need the mutex above?
> The only place where we test this variable, as far as I can find, is in
> wait_for_commit::register_wait_for_prior_commit()

Yes, it looks strange. Unfortunately I cannot remember the reason 100% right
now (I wrote this code 9 months ago). I remember that there _was_ a reason. I
should have written a comment to explain this, shame on me!

I think the code basically needs to do the same as
wait_for_commit::wakeup_subsequent_commits2(), but integrated with the tree
traversal.

I agree this needs clarification and looks suspicious, but I wanted to get the
rest of the comments sent to you today, I will try to re-visit this later (and
the similar code in wakeup_subsequent_commits2()).

>
> This thread doesn't know if the other thread is just before
> @@ -6597,7 +6756,10 @@ MYSQL_BIN_LOG::write_transaction_to_binl
>  
>      if (next)
>      {
> -      next->thd->signal_wakeup_ready();
> +      if (next->queued_by_other)
> +        next->thd->wait_for_commit_ptr->wakeup();
>
> What does the above code mean ? (add a comment).

I added this comment:

      /*
        Wake up the next thread in the group commit.

        The next thread can be waiting in two different ways, depending on
        whether it put itself in the queue, or if it was put in queue by us
        because it had to wait for us to commit first.

        So execute the appropriate wakeup, identified by the queued_by_other
        field.
      */

> +  mysql_mutex_unlock(&LOCK_log);
>
> Here you are going to take LOCK_log and LOCK_prepare_ordered in the
> wrong order.  safe_mutex() should have noticed this.
>
> In other code you are first taking LOCK_log and then
> LOCK_prepare_ordered.  If someone else is calling
> trx_group_commit_leader() while you are in this code you would get a
> mutex deadlock.

Well spottet, thanks.

I fixed it like this:

  /*
    We must not wait for LOCK_log while holding LOCK_prepare_ordered.
    LOCK_log can be held for long periods (eg. we do I/O under it), while
    LOCK_prepare_ordered must only be held for short periods.

    In addition, waiting for LOCK_log while holding LOCK_prepare_ordered would
    violate locking order of LOCK_log-before-LOCK_prepare_ordered. This could
    cause SAFEMUTEX warnings (even if it cannot actually deadlock with current
    code, as there can be at most one group commit leader thread at a time).

    So release and re-acquire LOCK_prepare_ordered if we need to wait for the
    LOCK_log.
  */
  if (!mysql_mutex_trylock(&LOCK_log))
  {
    mysql_mutex_unlock(&LOCK_prepare_ordered);
    mysql_mutex_lock(&LOCK_log);
    mysql_mutex_lock(&LOCK_prepare_ordered);
  }

> === modified file 'sql/log_event.h'
> --- sql/log_event.h	2013-06-06 15:51:28 +0000
> +++ sql/log_event.h	2013-08-29 08:18:22 +0000
> @@ -1317,9 +1317,9 @@ class Log_event
>  
>       @see do_apply_event
>     */
> -  int apply_event(Relay_log_info const *rli)
> +  int apply_event(struct rpl_group_info *rgi)
>
> What is the simple rule to know when to use Relay_log_info and when to
> use rpl_group_info ?

Relay_log_info is for stuff that needs to be shared among all events executing
in parallel (eg. current position in relay log).

rpl_group_info is for things that should be private to each event group being
replicated in parallel (such as the THD).

> === modified file 'sql/log_event_old.cc'
> --- sql/log_event_old.cc	2013-02-19 10:45:29 +0000
> +++ sql/log_event_old.cc	2013-08-29 08:18:22 +0000
> @@ -67,7 +68,7 @@ Old_rows_log_event::do_apply_event(Old_r
>      do_apply_event(). We still check here to prevent future coding
>      errors.
>    */
> -  DBUG_ASSERT(rli->sql_thd == ev_thd);
> +  DBUG_ASSERT(rgi->thd == ev_thd);
>
> Should we change all test for rli->sql_thd to rgi->thd ?
> You have at least this test in log_event.c and slave.cc.
>
> I assume that rli should not have a sql_thd member anymore ?

Yes, this is work in progress, to check all fields in the old Relay_log_info
class and move away those that should use rpl_group_info instead.

And yes, rli->sql_thd should go away. I suspect we might still need a THD
somewhere in the Relay_log_info for the main SQL thread (not sure), but if we
do it is probably best to rename it so that we do not later accidentally merge
code without fixing any references to sql_thd (which will most likely be
wrong).

> +   - We should fail if we connect to the master with opt_slave_parallel_threads
> +     greater than zero and master does not support GTID. Just to avoid a bunch
> +     of potential problems, we won't be able to do any parallel replication
> +     in this case anyway.
>
> If the master doesn't support GTID, I think we should instead of
> stopping, just issue a warning in the log and continue with normal
> replication.

Agree, that sounds right.

> A much easier way to achive the wait is to have the caller lock
> mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it
> when it's time for this thread to continue.
> - No need for signaling or checking if the thread got the signal
> - No need to have a cond wait here
> - No need to have a delay_start flag

I consider this method simpler and clearer. But it is not critical code, feel
free to change it if you like.

> Note that events may be undefined here if thd->killed was set above
> (for the first loop)

Ok, fixed.

> It would be good to check for thd->killed or rpt->stop here and then abort
> the loop.
>
> Another question: do we really need rpt->stop ?
> Isn't it enough with using the thd->killed for this ?

I think there is a need to generally go through and check for proper handling
of thd->killed, not just here but in all cases. I did not have time to do this
yet, and I am in any case a bit unsure about exactly how kill semantics should
work.

Now that I think about it, shouldn't they remain different?

I think thd->killed is set if user does an explicit KILL of the thread? This
should then abort and fail the currently executing transaction. But it should
not cause the thread to terminate, should it? That could leave us with no
threads at all in the replication thread pool, which would cause replication
to hang. Or should the thread exit and be re-spawned?

This needs more thought, I think ... certainly something looks not right.

>  void Relay_log_info::stmt_done(my_off_t event_master_log_pos,
> -                               time_t event_creation_time, THD *thd)
> +                               time_t event_creation_time, THD *thd,
> +                               struct rpl_group_info *rgi)
>
> Do we need thd anymore as this is in rgi->thd?
>
> If they are not the same, please add a comment about this. It's not otherwise
> clear when to use rgi->thd and thd as this function uses both.

This is ongoing work, so far I've just added the rgi parameter on some
functions that needed it. It is probably true that we do not need to THD
argument (on the other hand it was probably not needed before either, as we
have rli->sql_thd).

> === modified file 'sql/sql_class.cc'
> --- sql/sql_class.cc	2013-06-06 15:51:28 +0000
> +++ sql/sql_class.cc	2013-08-29 08:18:22 +0000
>
> +void
> +wait_for_commit::wakeup()
> +{
> +  /*
> +    We signal each waiter on their own condition and mutex (rather than using
> +    pthread_cond_broadcast() or something like that).
> +
> +    Otherwise we would need to somehow ensure that they were done
> +    waking up before we could allow this THD to be destroyed, which would
> +    be annoying and unnecessary.
> +  */
> +  mysql_mutex_lock(&LOCK_wait_commit);
> +  waiting_for_commit= false;
> +  mysql_cond_signal(&COND_wait_commit);
> +  mysql_mutex_unlock(&LOCK_wait_commit);
> +}
>
> In this particular case it looks safe to move the cond signal out of
> the mutex. I don't see how anyone could miss the signal.

Probably. Generally I prefer to write it as above, to not have to make sure
that moving it out is safe.

> +  The waiter needs to lock the waitee to delete itself from the list in
> +  unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not
> +  hold its own lock while locking waiters, lest we deadlock.
>
> lest ?

    https://en.wiktionary.org/wiki/lest

But I've changed it to use a more common English wording "as this could lead
to deadlock".

> +  The reason for separate register and wait calls is that this allows to
> +  register the wait early, at a point where the waited-for THD is known to
> +  exist. And then the actual wait can be done much later, where the
> +  waited-for THD may have been long gone. By registering early, the waitee
> +  can signal before disappearing.
> +*/
>
> Instead of taking locks to ensure that THD will disapper, why not keep
> all THD objects around until all transactions in a group has been
> executed?

I think it is not just about THD disappearing. IIRC, the more important
problem is that the THD may have moved on to do something different. So even
if the data in the other THD is not freed, it may still contain completely
unrelated data (and thus wrong data, for the accessing thread).

But this is actually something I consider a very important thing, and one that
I spent quite some effort on and was quite pleased to get in the current
state. If arbitrary THD can have pointers into arbitrary other THD, the
life-time issues can quickly get very nasty, and it is important to handle
this well to avoid having to understand the entire codebase to avoid subtle bugs.
Follow ups

Re: Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-22
References

Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-13