maria-developers team mailing list archive

Thread
Date
Re: Review of MDEV-4506, parallel replication, part 1

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Sun, 22 Sep 2013 17:17:53 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87ppscc1eb.fsf@frigg.knielsen-hq.org>
Reply-to: monty@xxxxxxxxxxxx
Hi!

>>>>> "Kristian" == Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

Kristian> Monty,
Kristian> Thanks for the comments so far.

Kristian> I have pushed fixes to 10.0-knielsen, and some specific replies are below.

Thanks. I will review that today.

Kristian>  - Kristian.

Kristian> Michael Widenius <monty@xxxxxxxxxxxx> writes:

>> + --slave-parallel-threads=# 
>> + If non-zero, number of threads to spawn to apply in
>> + parallel events on the slave that were group-committed on
>> + the master or were logged with GTID in different
>> + replication domains.
>> 
>> What happens if --slave-parallel-threads is 0 ?
>> Does this mean that there will be no parallel threads?
>> Is this the same as 1 ?
>> 
>> If same as 1, why not have the option to take values starting from 1 ?

Kristian> I think it is logical that 0 means disabled and >0 means enabled with that
Kristian> many threads. But I don't mind using 1 as disabled and needing >=2 for
Kristian> enabled. Though the ability to use a pool with just 1 thread might be useful
Kristian> for testing. It doesn't seem critical, maybe we can revisit this if time
Kristian> allows when the basic stuff is done. 

I come up with one case where 0 is better
- In multi-source 0 would mean that we have one thread per connection.

You can't get that behaviour if we change the meaning of the
slave-parallel-thread variable.

>> === modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result'
>> --- mysql-test/suite/perfschema/r/dml_setup_instruments.result	2013-03-26 09:35:34 +0000
>> +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result	2013-08-29 08:18:22 +0000
>> @@ -38,14 +38,14 @@ order by name limit 10;
>> NAME	ENABLED	TIMED
>> wait/synch/cond/sql/COND_flush_thread_cache	YES	YES
>> wait/synch/cond/sql/COND_manager	YES	YES
>> +wait/synch/cond/sql/COND_parallel_entry	YES	YES
>> +wait/synch/cond/sql/COND_prepare_ordered	YES	YES
>> wait/synch/cond/sql/COND_queue_state	YES	YES
>> wait/synch/cond/sql/COND_rpl_status	YES	YES
>> +wait/synch/cond/sql/COND_rpl_thread	YES	YES
>> +wait/synch/cond/sql/COND_rpl_thread_pool	YES	YES
>> wait/synch/cond/sql/COND_server_started	YES	YES
>> wait/synch/cond/sql/COND_thread_cache	YES	YES
>> -wait/synch/cond/sql/COND_thread_count	YES	YES
>> -wait/synch/cond/sql/Delayed_insert::cond	YES	YES
>> -wait/synch/cond/sql/Delayed_insert::cond_client	YES	YES
>> -wait/synch/cond/sql/Event_scheduler::COND_state	YES	YES
>> 
>> Any idea why the above was deleted?
>> Don't see anything in your patch that could affect that.

Kristian> I believe it does a SELECT with a LIMIT - so when new stuff is added before
Kristian> the limit, stuff below gets pushed out.

Kristian> But if you ask why the test is that way, I cannot answer. The purpose of some
Kristian> of those perfschema tests is somewhat of a mystery, they seem to serve little
Kristian> other purpose than add maintenance cost for result file update whenever
Kristian> something changes ...

ok, I missed the 'limit part'. I can only say 'stupid test' about this...

<cut>

>> @@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl
>> }
>> 
>> Could you please document in detail the arguments and return value
>> for the following function.

Kristian> I added the following comment:

Kristian> /*
Kristian>   Put a transaction that is ready to commit in the group commit queue.
Kristian>   The transaction is identified by the ENTRY object passed into this function.

Kristian>   To facilitate group commit for the binlog, we first queue up ourselves in
Kristian>   this function. Then later the first thread to enter the queue waits for
Kristian>   the LOCK_log mutex, and commits for everyone in the queue once it gets the
Kristian>   lock. Any other threads in the queue just wait for the first one to finish
Kristian>   the commit and wake them up. This way, all transactions in the queue get
Kristian>   committed in a single disk operation.

Kristian>   The return value of this function is TRUE if queued as the first entry in
Kristian>   the queue (meaning this is the leader), FALSE otherwise.

Kristian>   The main work in this function is when the commit in one transaction has
Kristian>   been marked to wait for the commit of another transaction to happen
Kristian>   first. This is used to support in-order parallel replication, where
Kristian>   transactions can execute out-of-order but need to be committed in-order with
Kristian>   how they happened on the master. The waiting of one commit on another needs
Kristian>   to be integrated with the group commit queue, to ensure that the waiting
Kristian>   transaction can participate in the same group commit as the waited-for
Kristian>   transaction.

Kristian>   So when we put a transaction in the queue, we check if there were other
Kristian>   transactions already prepared to commit but just waiting for the first one
Kristian>   to commit. If so, we add those to the queue as well, transitively for all
Kristian>   waiters.
Kristian> */
Kristian> bool
Kristian> MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry)

Kristian> I also moved some code from the caller into this function, which clarifies
Kristian> things by keeping related code together. And I added more comments about what
Kristian> is going on with the different list pointers, and who is doing what.


Thanks. That will help a lot in making thins easier to understand.

<cut>

>> I did find it confusing that here you threat next_subsequent_commit as
>> a way to avoid recursion, but you use this list for other things in
>> register_wait_for_prior_commit().

Kristian> I added this comment:

Kristian>     We keep a list of all the waiters that need to be processed in `list',
Kristian>     linked through the next_subsequent_commit pointer. Initially this list
Kristian>     contains only the entry passed into this function.

Kristian>     We process entries in the list one by one. The element currently being
Kristian>     processed is pointed to by `cur`, and the element at the end of the list
Kristian>     is pointed to by `last` (we do not use NULL to terminate the list).

Kristian>     As we process an element, it is first added to the group_commit_queue.
Kristian>     Then any waiters for that element are added at the end of the list, to
Kristian>     be processed in subsequent iterations. This continues until the list
Kristian>     is exhausted, with all elements ever added eventually processed.

Kristian>     The end result is a breath-first traversal of the tree of waiters,
Kristian>     re-using the next_subsequent_commit pointers in place of extra stack
Kristian>     space in a recursive traversal.

Great. The one thing I am missing was:

The temporary list created in next_subsequent_commit is not
used by the caller or any other function.


>> +  /* Now we need to clear the wakeup_subsequent_commits_running flags. */
>> +  if (list)
>> +  {
>> +    for (;;)
>> +    {
>> +      if (list->wakeup_subsequent_commits_running)
>> +      {
>> +        mysql_mutex_lock(&list->LOCK_wait_commit);
>> +        list->wakeup_subsequent_commits_running= false;
>> +        mysql_mutex_unlock(&list->LOCK_wait_commit);
>> 
>> Why do we need the mutex above?
>> The only place where we test this variable, as far as I can find, is in
>> wait_for_commit::register_wait_for_prior_commit()

Kristian> Yes, it looks strange. Unfortunately I cannot remember the reason 100% right
Kristian> now (I wrote this code 9 months ago). I remember that there _was_ a reason. I
Kristian> should have written a comment to explain this, shame on me!

Kristian> I think the code basically needs to do the same as
Kristian> wait_for_commit::wakeup_subsequent_commits2(), but integrated with the tree
Kristian> traversal.

Kristian> I agree this needs clarification and looks suspicious, but I wanted to get the
Kristian> rest of the comments sent to you today, I will try to re-visit this later (and
Kristian> the similar code in wakeup_subsequent_commits2()).

ok. Please take a look at this when you are ready with other things.
(Add at least a 'todo tag to this code).

<cut>

>> A much easier way to achive the wait is to have the caller lock
>> mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it
>> when it's time for this thread to continue.
>> - No need for signaling or checking if the thread got the signal
>> - No need to have a cond wait here
>> - No need to have a delay_start flag

Kristian> I consider this method simpler and clearer. But it is not critical code, feel
Kristian> free to change it if you like.

I will do this change, probably today.
It will make some of the code a lot simpler. Wait and see...

<cut>

>> Another question: do we really need rpt->stop ?
>> Isn't it enough with using the thd->killed for this ?

Kristian> I think there is a need to generally go through and check for proper handling
Kristian> of thd->killed, not just here but in all cases. I did not have time to do this
Kristian> yet, and I am in any case a bit unsure about exactly how kill semantics should
Kristian> work.

Kristian> Now that I think about it, shouldn't they remain different?

Kristian> I think thd->killed is set if user does an explicit KILL of the thread? This
Kristian> should then abort and fail the currently executing transaction. But it should
Kristian> not cause the thread to terminate, should it? That could leave us with no
Kristian> threads at all in the replication thread pool, which would cause replication
Kristian> to hang. Or should the thread exit and be re-spawned?

What should happen if you kill a replication thread is that
replication should stop for that master.

Kristian> This needs more thought, I think ... certainly something looks not right.

After looking at the full code, I think that the logical way things
should work is:

'stop' is to be used when you want to nicely take done replication.
This means that the current commit groups should be given time to
finish.

thd->killed should mean that we should stop ASAP.
- All not commited things should abort.

This is needed in a 'panic shutdown' (like out soon-out-of- power) or
when trying to kill the replication thread when one notices that
something went horribly wrong (like ALTER TABLE stopping replication).

>> === modified file 'sql/sql_class.cc'
>> --- sql/sql_class.cc	2013-06-06 15:51:28 +0000
>> +++ sql/sql_class.cc	2013-08-29 08:18:22 +0000
>> 
>> +void
>> +wait_for_commit::wakeup()
>> +{
>> +  /*
>> +    We signal each waiter on their own condition and mutex (rather than using
>> +    pthread_cond_broadcast() or something like that).
>> +
>> +    Otherwise we would need to somehow ensure that they were done
>> +    waking up before we could allow this THD to be destroyed, which would
>> +    be annoying and unnecessary.
>> +  */
>> +  mysql_mutex_lock(&LOCK_wait_commit);
>> +  waiting_for_commit= false;
>> +  mysql_cond_signal(&COND_wait_commit);
>> +  mysql_mutex_unlock(&LOCK_wait_commit);
>> +}
>> 
>> In this particular case it looks safe to move the cond signal out of
>> the mutex. I don't see how anyone could miss the signal.

Kristian> Probably. Generally I prefer to write it as above, to not have to make sure
Kristian> that moving it out is safe.

I agree that this is the way to do in general.
However for cases where we are waiting almost once per query, we need
to make things faster as this extra wakup will take notable resources!

>> +  The waiter needs to lock the waitee to delete itself from the list in
>> +  unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not
>> +  hold its own lock while locking waiters, lest we deadlock.
>> 
>> lest ?

Kristian>     https://en.wiktionary.org/wiki/lest

Kristian> But I've changed it to use a more common English wording "as this could lead
Kristian> to deadlock".

Thanks. It's easier to read the code when you don't need to consult a
dictionary...

>> +  The reason for separate register and wait calls is that this allows to
>> +  register the wait early, at a point where the waited-for THD is known to
>> +  exist. And then the actual wait can be done much later, where the
>> +  waited-for THD may have been long gone. By registering early, the waitee
>> +  can signal before disappearing.
>> +*/
>> 
>> Instead of taking locks to ensure that THD will disapper, why not keep
>> all THD objects around until all transactions in a group has been
>> executed?

Kristian> I think it is not just about THD disappearing. IIRC, the more important
Kristian> problem is that the THD may have moved on to do something different. So even
Kristian> if the data in the other THD is not freed, it may still contain completely
Kristian> unrelated data (and thus wrong data, for the accessing thread).

Kristian> But this is actually something I consider a very important thing, and one that
Kristian> I spent quite some effort on and was quite pleased to get in the current
Kristian> state. If arbitrary THD can have pointers into arbitrary other THD, the
Kristian> life-time issues can quickly get very nasty, and it is important to handle
Kristian> this well to avoid having to understand the entire codebase to avoid subtle bugs.

ok. Thanks for the explanation.

Regards,
Monty
Follow ups

Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-23
References

Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-13
Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-13