maria-developers team mailing list archive

Thread
Date
Re: Review of MDEV-4506, parallel replication, part 1

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Mon, 23 Sep 2013 18:51:47 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87eh8fr9k5.fsf@frigg.knielsen-hq.org>
Reply-to: monty@xxxxxxxxxxxx
Hi!

>>>>> "Kristian" == Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

Kristian> Michael Widenius <monty@xxxxxxxxxxxx> writes:
>> What should happen if you kill a replication thread is that
>> replication should stop for that master.

<cut>

Kristian> It seems to make sense that KILL CONNECTION on the SQL thread would stop the
Kristian> replication. I do not know whether this is how it works or not in current
Kristian> replication, but whatever it is, we should just leave the behaviour the same.

This is how things are now (as far as I know).

<cut>

Kristian> But KILL of a worker thread should _not_ cause the thread to exit, I
Kristian> think. The pool of parallel threads is just a thread pool, and for simplicity
Kristian> in the first version, it has a fixed size. A KILL will disconnect the thread
Kristian> from the replication connection it was servicing, but the thread must remain
Kristian> alive, ready to serve another connection if needed.

Correct. The KILL command only kills the query or the connection (ie,
THD). The thread is always reused.

Kristian> So the rpt->stop is only about maintaining the thread pool for parallel
Kristian> replication, not about anything to do with aborting any specific master
Kristian> connection replication. In rpl_parallel_change_thread_count() we set rpt->stop
Kristian> to ask all existing threads to exit, and afterwards spawn a new set of
Kristian> threads. This can only happen when all replication SQL/IO threads are stopped
Kristian> and no event execution is taking place.

Yes. The questions is when rpt->stop should take effect.

My suggestion is that the replication threads should only examine
rpt->stop after commits.  This way we can ensure that when all threads
are stopped, everything is committed to a certain point and we never
have to do a rollback.

Kristian> So this code of mine is actually wrong, I think:

Kristian>   while (!rpt->stop && !thd->killed)

Kristian> It should just be while (!rpt->stop) { ... }.
Kristian> And thd->killed should not be used at all to control thread
Kristian> termination. Instead, it should be used (and is not currently, needs to be
Kristian> fixed) in various places for event execution to allow aborting currently
Kristian> executing event. This is needed around rpt_handle_event() I think, and also
Kristian> around queue_for_group_commit().

Yes.  I will be working on this today & tomorrow.

Note that every loop where we wait needs to be stoppable, either with
'stop' or with KILL.

Kristian> For normal stop, this is handled mostly in the SQL thread. It waits with a
Kristian> timeout for the current event group to finish normally, then does a hard kill
Kristian> if needed. This needs to be extended to handle running things in a worker
Kristian> thread, of course. The code is around sql_slave_killed(), I think.

Kristian> Does that make sense?

Makes sence. I will know more tomorrow when I have dug into the code a
bit more.

>>>> +  mysql_mutex_lock(&LOCK_wait_commit);
>>>> +  waiting_for_commit= false;
>>>> +  mysql_cond_signal(&COND_wait_commit);
>>>> +  mysql_mutex_unlock(&LOCK_wait_commit);
>>>> +}
>>>> 
>>>> In this particular case it looks safe to move the cond signal out of
>>>> the mutex. I don't see how anyone could miss the signal.
>> 
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure
Kristian> that moving it out is safe.
>> 
>> I agree that this is the way to do in general.
>> However for cases where we are waiting almost once per query, we need
>> to make things faster as this extra wakup will take notable resources!

Kristian> Ok, no problem, I've changed it.

Kristian> I should benchmark this, it keeps coming up. It's not clear to me which way
Kristian> would be the fastest, seems it would depend on the implementation.

I did a benchmark of this a long time ago (+10 years) on Linux and
then using the signal after unlock was notable faster.

The gain is two thread switches + one thread wakeup per cond_signal.



>> Hi!
>> 
>> Part 2 of review of parallel replication
>> 
>> === added file 'sql/rpl_parallel.cc'
>> 
>> <cut>
>> 
>> +pthread_handler_t
>> +handle_rpl_parallel_thread(void *arg)
>> +{
>> 
>> <cut>
>> 
>> +      if (end_of_group)
>> +      {
>> +        in_event_group= false;
>> +
>> 
>> Add comment:
>> 
>> /*
>> All events for this group has now been executed and logged (but not
>> necessaerly synced).
>> Inform the other event groups that are waiting for this thread that
>> we are done.
>> */
>> 
>> +        rgi->commit_orderer.unregister_wait_for_prior_commit();
>> +        thd->wait_for_commit_ptr= NULL;
>> +
>> +        /*
>> +          Record that we have finished, so other event groups will no
>> +          longer attempt to wait for us to commit.
>> +
>> +          We can race here with the next transactions, but that is fine, as
>> +          long as we check that we do not decrease last_committed_sub_id. If
>> +          this commit is done, then any prior commits will also have been
>> +          done and also no longer need waiting for.
>> +        */
>> +        mysql_mutex_lock(&entry->LOCK_parallel_entry);
>> +        if (entry->last_committed_sub_id < event_gtid_sub_id)
>> +        {
>> +          entry->last_committed_sub_id= event_gtid_sub_id;
>> +          mysql_cond_broadcast(&entry->COND_parallel_entry);
>> +        }
>> +        mysql_mutex_unlock(&entry->LOCK_parallel_entry);
>> +
>> 
>> Add:
>> 

>> /*
>> Tell storage engines that they can stop waiting.
>> See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't
>> do this part of the wakeup in unregister_wait_for_prior_commit().
>> */

Kristian> I think there is some misunderstanding here. I added some comments that will
Kristian> hopefully clarify, but the situation is a bit different than how I read this
Kristian> suggestion.

I may have missunderstood some of the code. Then it's even more
important to get a good comment!

Kristian> The situation is that we have for example three threads, A, B, and C. We are
Kristian> thread B at this point in the code. Thread B needed to wait for A to commit
Kristian> first. C needs to wait for B to commit.

Kristian> The call to unregister_wait_for_prior_commit() relates to B waiting for A, it
Kristian> cannot be used in relation to C waiting for B, irrespectively of comments in
Kristian> the function wakeup_subsequent_commits2() ...

ok. I thought that when we called unregister_wait_for_prior_commit()
we had already handled the event including logging.
(Apparently rpt_handle_event() does less than what I expected).

Kristian> The unregister_wait_for_prior_commit() removes B from the list of threads
Kristian> waiting for A. This is usually redundant, because B already removed itself
Kristian> with wait_for_prior_commit() during the commit step. However, if there is for
Kristian> example an error we might not reach commit, and thus we need to remove the
Kristian> dangling entry in A.

ok, this was a bit unclear.
I thought that B could not stop waiting for A until A had written
everything to the binary log.

I don't see how we B can stop waiting for A until A at least has
done all commands and written all it's logs.

Kristian> The wakeup_subsequent_commits() signals C that B is done. Again, this is often
Kristian> redundant, because we already signalled C during B's commit step. However, C
Kristian> could have registered to wait for B between B's commit and this place in the
Kristian> code, so it is needed for correctness.

This part is clear.

<cut>

>> /*
>> The following loops wait until each thread is done it's work for each
>> event group and then stops it when the thread has nothing to do.
>> 
>> Note that this may take some time as rpl_paralell:do_event() may request
>> threads at the same time.
>> */

Kristian> I added this comment instead:

Kristian>   /*
Kristian>     Grab each old thread in turn, and signal it to stop.

Kristian>     Note that since we require all replication threads to be stopped before
Kristian>     changing the parallel replication worker thread pool, all the threads will
Kristian>     be already idle and will terminate immediately.
Kristian>   */

Kristian> Hope that clarifies things.

Yes, it does. Thanks.

>> +void
>> +rpl_parallel::wait_for_done()
>> +{
>> +  struct rpl_parallel_entry *e;
>> +  uint32 i;
>> +
>> +  for (i= 0; i < domain_hash.records; ++i)
>> +  {
>> +    e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i);
>> +    mysql_mutex_lock(&e->LOCK_parallel_entry);
>> +    while (e->current_sub_id > e->last_committed_sub_id)
>> +      mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
>> +    mysql_mutex_unlock(&e->LOCK_parallel_entry);
>> +  }
>> +}
>> 
>> How to handle errors in the above case?
>> For example when we get an error and we had to abort things?

Kristian> I think even if we have to abort event execution in a worker thread, we should
Kristian> still increment e->last_committed_sub_id. So the wait here is ok. But we do
Kristian> need code to be able to force all worker threads to stop.

Kristian> An important piece of this puzzle is in sql_slave_killed(). I do not
Kristian> understand the full details of this code yet, but I believe the idea is to
Kristian> wait with some timeout for event groups to finish normally, then hard kill if
Kristian> the timeout triggers (I vaguely remember that for InnoDB we can just kill and
Kristian> rollback, while for MyISAM the timeout is necessary to avoid leaving half an
Kristian> update executed).

We could extend this to check if we are using transactional tables or
not and wait longer if not transaction tables has been used for the query.

Kristian> We need to extend sql_slave_killed() to also handle waiting for each worker
Kristian> thread to stop, with the timeout. Once we have that, it is hopefully clear how
Kristian> to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no
Kristian> longer needed (wait_for_done() is just a temporary thing I put in to get
Kristian> things working for the benchmark).

Kristian> We also need to extend the API around struct wait_for_commit to include error
Kristian> handling. I suppose wait_for_prior_commit() should check for thd->killed when
Kristian> it waits, and be extended to return an error if killed. And then also
Kristian> wakeup_subsequent_commits() should be extended to take an error argument, so
Kristian> that if B is waiting for A, and A gets an error, then A can call
Kristian> wakeup_subsequent_commits() with error=true, which will cause B to wake up but
Kristian> also get an error.

Yes, that sounds right.

Kristian> Then if we have a number of event groups processing in parallel,
A-> B->C->D. And B ends up with an error. Then A can complete, but B will
Kristian> signal an error to C (and then to D), so both B, C, and D will fail.

One problem we have is that if B, C ,D are using non transactional
tables, it would be better to have them complete than abort.

This issue we can leave for now.

Kristian> Then somewhere, perhaps around the end of apply_event_and_update_pos(), we
Kristian> must check an error flag set for B, C, and D, and signal the error back to the
Kristian> main SQL thread so that it can halt and print the error to the error log, as
Kristian> usual.

Kristian> Something like that, I think. Some work to do, but should be possible at least.

Yes. Not too hard, I think.

>> The other way would be to, on error, set last_committed_sub_id back to
>> it's last value.

>> What do you suggest?
>> Assuming setting the value back would be the easiest solution..

Kristian> Yes (if I understood correctly what you meant with "setting back")...
Kristian> I think we should update things the same way whether the transaction commits
Kristian> successfully or fails, so we get the same wakeup of later transactions, but
Kristian> with the extra error flag added to wait_for_commit so that later transactions
Kristian> can see the failure and thus fail and rollback themselves. And then the error
Kristian> can be propagated up the call stack, as normal.

Kristian> Of course, this becomes more complex due to the need to be able to re-try a
Kristian> transaction in case of certain temporary errors like deadlocks and so.
Kristian> Hopefully something can be made to work.

As we should never get a deadlock or temporary error on a slave, this
should not be too hard for a first version.

>> +      /*
>> +        We are already executing something else in this domain. But the two
>> +        event groups were committed together in the same group commit on the
>> +        master, so we can still do them in parallel here on the slave.
>> +
>> +        However, the commit of this event must wait for the commit of the prior
>> +        event, to preserve binlog commit order and visibility across all
>> +        servers in the replication hierarchy.
>> +      */
>> +      rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
>> 
>> Here we could add a test if get_thread() is 0, which could happen on shutdown
>> or when killing a replication thread.

Kristian> I do not think we need to do this. I made it so that the worker thread pool
Kristian> can never change size without stopping all replication threads first, so we
Kristian> avoid a lot of these issues. And as I wrote above, killing a worker thread
Kristian> should terminate the event group that it is currently servicing (and the
Kristian> associated SQL thread and master connection), but not remove the thread from
Kristian> the pool.

I was thinking about the case where all worker threads are occupied
for a long time.  In this case we will get do_event() stop a long time
until a new thread is free.

I do see some small advantages in beeing able to free the sql-thread
early.

If get_thread() is not a stop point, then we have the problem at terminate,
that do_event() will wait for get_thread() and then schedule a new query()
even if the server should be stopping.

We need to check for the sql-thread being killed before it tells a
worker thread to start with new queries...

Kristian> (I can imagine that in a later version we want to implement a smarter worker
Kristian> thread manager, but for the first version I tried to keep things simple, there
Kristian> are plenty of other problems that we need to tackle...)

>> I think it's relatively safe to use get_thread() as a end-synchronize
>> point as we should never have anything half-done when we call
>> get_thread()).  As get_thread() can take a very long time and is not
>> called in many places, it's also sounds like a resonable thing to do.

Kristian> Sorry, I do not understand. "End-synchronize point" for what?

Sorry, wrong word.

What I meant was a point where it's safe to detect that the server or
replication thread has been killed.

>> +    /*
>> +      Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread.
>> +      Same for events not preceeded by GTID (we should not see those normally,
>> +      but they might be from an old master).
>> +    */
>> +    qev->rgi= serial_rgi;
>> +    rpt_handle_event(qev, NULL);
>> +    delete_or_keep_event_post_apply(rli, typ, qev->ev);
>> 
>> Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()?
>> You always call these two together...

Kristian> It is to share code with the non-parallel case.
Kristian> When parallel replication is disabled, everything is done in the SQL thread in
Kristian> slave.cc. Like this:

Kristian>     exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL);
Kristian>     delete_or_keep_event_post_apply(serial_rgi, typ, ev);

Kristian> Or did you mean that I could move the call of
Kristian> delete_or_keep_event_post_apply() into the function rpt_handle_event()?
Kristian> That might be possible, though it was not 100% clear to me as there are a few
Kristian> differences in how it is done in the different cases.

Yes, I meant moving the function into rpt_handle_event().

Kristian> But in any case that part of the code needs more work. Especially with regard
Kristian> to error handling, and related to updating the last executed event
Kristian> position. Probably it makes sense to revisit this as part of that work.

ok.

Regards,
Monty
References

Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-13
Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-13
Re: Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-22
Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-23