maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #06300
Re: Review of MDEV-4506, parallel replication, part 1
Hi!
>>>>> "Kristian" == Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:
Kristian> Michael Widenius <monty@xxxxxxxxxxxx> writes:
>> What should happen if you kill a replication thread is that
>> replication should stop for that master.
<cut>
Kristian> It seems to make sense that KILL CONNECTION on the SQL thread would stop the
Kristian> replication. I do not know whether this is how it works or not in current
Kristian> replication, but whatever it is, we should just leave the behaviour the same.
This is how things are now (as far as I know).
<cut>
Kristian> But KILL of a worker thread should _not_ cause the thread to exit, I
Kristian> think. The pool of parallel threads is just a thread pool, and for simplicity
Kristian> in the first version, it has a fixed size. A KILL will disconnect the thread
Kristian> from the replication connection it was servicing, but the thread must remain
Kristian> alive, ready to serve another connection if needed.
Correct. The KILL command only kills the query or the connection (ie,
THD). The thread is always reused.
Kristian> So the rpt->stop is only about maintaining the thread pool for parallel
Kristian> replication, not about anything to do with aborting any specific master
Kristian> connection replication. In rpl_parallel_change_thread_count() we set rpt->stop
Kristian> to ask all existing threads to exit, and afterwards spawn a new set of
Kristian> threads. This can only happen when all replication SQL/IO threads are stopped
Kristian> and no event execution is taking place.
Yes. The questions is when rpt->stop should take effect.
My suggestion is that the replication threads should only examine
rpt->stop after commits. This way we can ensure that when all threads
are stopped, everything is committed to a certain point and we never
have to do a rollback.
Kristian> So this code of mine is actually wrong, I think:
Kristian> while (!rpt->stop && !thd->killed)
Kristian> It should just be while (!rpt->stop) { ... }.
Kristian> And thd->killed should not be used at all to control thread
Kristian> termination. Instead, it should be used (and is not currently, needs to be
Kristian> fixed) in various places for event execution to allow aborting currently
Kristian> executing event. This is needed around rpt_handle_event() I think, and also
Kristian> around queue_for_group_commit().
Yes. I will be working on this today & tomorrow.
Note that every loop where we wait needs to be stoppable, either with
'stop' or with KILL.
Kristian> For normal stop, this is handled mostly in the SQL thread. It waits with a
Kristian> timeout for the current event group to finish normally, then does a hard kill
Kristian> if needed. This needs to be extended to handle running things in a worker
Kristian> thread, of course. The code is around sql_slave_killed(), I think.
Kristian> Does that make sense?
Makes sence. I will know more tomorrow when I have dug into the code a
bit more.
>>>> + mysql_mutex_lock(&LOCK_wait_commit);
>>>> + waiting_for_commit= false;
>>>> + mysql_cond_signal(&COND_wait_commit);
>>>> + mysql_mutex_unlock(&LOCK_wait_commit);
>>>> +}
>>>>
>>>> In this particular case it looks safe to move the cond signal out of
>>>> the mutex. I don't see how anyone could miss the signal.
>>
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure
Kristian> that moving it out is safe.
>>
>> I agree that this is the way to do in general.
>> However for cases where we are waiting almost once per query, we need
>> to make things faster as this extra wakup will take notable resources!
Kristian> Ok, no problem, I've changed it.
Kristian> I should benchmark this, it keeps coming up. It's not clear to me which way
Kristian> would be the fastest, seems it would depend on the implementation.
I did a benchmark of this a long time ago (+10 years) on Linux and
then using the signal after unlock was notable faster.
The gain is two thread switches + one thread wakeup per cond_signal.
>> Hi!
>>
>> Part 2 of review of parallel replication
>>
>> === added file 'sql/rpl_parallel.cc'
>>
>> <cut>
>>
>> +pthread_handler_t
>> +handle_rpl_parallel_thread(void *arg)
>> +{
>>
>> <cut>
>>
>> + if (end_of_group)
>> + {
>> + in_event_group= false;
>> +
>>
>> Add comment:
>>
>> /*
>> All events for this group has now been executed and logged (but not
>> necessaerly synced).
>> Inform the other event groups that are waiting for this thread that
>> we are done.
>> */
>>
>> + rgi->commit_orderer.unregister_wait_for_prior_commit();
>> + thd->wait_for_commit_ptr= NULL;
>> +
>> + /*
>> + Record that we have finished, so other event groups will no
>> + longer attempt to wait for us to commit.
>> +
>> + We can race here with the next transactions, but that is fine, as
>> + long as we check that we do not decrease last_committed_sub_id. If
>> + this commit is done, then any prior commits will also have been
>> + done and also no longer need waiting for.
>> + */
>> + mysql_mutex_lock(&entry->LOCK_parallel_entry);
>> + if (entry->last_committed_sub_id < event_gtid_sub_id)
>> + {
>> + entry->last_committed_sub_id= event_gtid_sub_id;
>> + mysql_cond_broadcast(&entry->COND_parallel_entry);
>> + }
>> + mysql_mutex_unlock(&entry->LOCK_parallel_entry);
>> +
>>
>> Add:
>>
>> /*
>> Tell storage engines that they can stop waiting.
>> See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't
>> do this part of the wakeup in unregister_wait_for_prior_commit().
>> */
Kristian> I think there is some misunderstanding here. I added some comments that will
Kristian> hopefully clarify, but the situation is a bit different than how I read this
Kristian> suggestion.
I may have missunderstood some of the code. Then it's even more
important to get a good comment!
Kristian> The situation is that we have for example three threads, A, B, and C. We are
Kristian> thread B at this point in the code. Thread B needed to wait for A to commit
Kristian> first. C needs to wait for B to commit.
Kristian> The call to unregister_wait_for_prior_commit() relates to B waiting for A, it
Kristian> cannot be used in relation to C waiting for B, irrespectively of comments in
Kristian> the function wakeup_subsequent_commits2() ...
ok. I thought that when we called unregister_wait_for_prior_commit()
we had already handled the event including logging.
(Apparently rpt_handle_event() does less than what I expected).
Kristian> The unregister_wait_for_prior_commit() removes B from the list of threads
Kristian> waiting for A. This is usually redundant, because B already removed itself
Kristian> with wait_for_prior_commit() during the commit step. However, if there is for
Kristian> example an error we might not reach commit, and thus we need to remove the
Kristian> dangling entry in A.
ok, this was a bit unclear.
I thought that B could not stop waiting for A until A had written
everything to the binary log.
I don't see how we B can stop waiting for A until A at least has
done all commands and written all it's logs.
Kristian> The wakeup_subsequent_commits() signals C that B is done. Again, this is often
Kristian> redundant, because we already signalled C during B's commit step. However, C
Kristian> could have registered to wait for B between B's commit and this place in the
Kristian> code, so it is needed for correctness.
This part is clear.
<cut>
>> /*
>> The following loops wait until each thread is done it's work for each
>> event group and then stops it when the thread has nothing to do.
>>
>> Note that this may take some time as rpl_paralell:do_event() may request
>> threads at the same time.
>> */
Kristian> I added this comment instead:
Kristian> /*
Kristian> Grab each old thread in turn, and signal it to stop.
Kristian> Note that since we require all replication threads to be stopped before
Kristian> changing the parallel replication worker thread pool, all the threads will
Kristian> be already idle and will terminate immediately.
Kristian> */
Kristian> Hope that clarifies things.
Yes, it does. Thanks.
>> +void
>> +rpl_parallel::wait_for_done()
>> +{
>> + struct rpl_parallel_entry *e;
>> + uint32 i;
>> +
>> + for (i= 0; i < domain_hash.records; ++i)
>> + {
>> + e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i);
>> + mysql_mutex_lock(&e->LOCK_parallel_entry);
>> + while (e->current_sub_id > e->last_committed_sub_id)
>> + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
>> + mysql_mutex_unlock(&e->LOCK_parallel_entry);
>> + }
>> +}
>>
>> How to handle errors in the above case?
>> For example when we get an error and we had to abort things?
Kristian> I think even if we have to abort event execution in a worker thread, we should
Kristian> still increment e->last_committed_sub_id. So the wait here is ok. But we do
Kristian> need code to be able to force all worker threads to stop.
Kristian> An important piece of this puzzle is in sql_slave_killed(). I do not
Kristian> understand the full details of this code yet, but I believe the idea is to
Kristian> wait with some timeout for event groups to finish normally, then hard kill if
Kristian> the timeout triggers (I vaguely remember that for InnoDB we can just kill and
Kristian> rollback, while for MyISAM the timeout is necessary to avoid leaving half an
Kristian> update executed).
We could extend this to check if we are using transactional tables or
not and wait longer if not transaction tables has been used for the query.
Kristian> We need to extend sql_slave_killed() to also handle waiting for each worker
Kristian> thread to stop, with the timeout. Once we have that, it is hopefully clear how
Kristian> to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no
Kristian> longer needed (wait_for_done() is just a temporary thing I put in to get
Kristian> things working for the benchmark).
Kristian> We also need to extend the API around struct wait_for_commit to include error
Kristian> handling. I suppose wait_for_prior_commit() should check for thd->killed when
Kristian> it waits, and be extended to return an error if killed. And then also
Kristian> wakeup_subsequent_commits() should be extended to take an error argument, so
Kristian> that if B is waiting for A, and A gets an error, then A can call
Kristian> wakeup_subsequent_commits() with error=true, which will cause B to wake up but
Kristian> also get an error.
Yes, that sounds right.
Kristian> Then if we have a number of event groups processing in parallel,
A-> B->C->D. And B ends up with an error. Then A can complete, but B will
Kristian> signal an error to C (and then to D), so both B, C, and D will fail.
One problem we have is that if B, C ,D are using non transactional
tables, it would be better to have them complete than abort.
This issue we can leave for now.
Kristian> Then somewhere, perhaps around the end of apply_event_and_update_pos(), we
Kristian> must check an error flag set for B, C, and D, and signal the error back to the
Kristian> main SQL thread so that it can halt and print the error to the error log, as
Kristian> usual.
Kristian> Something like that, I think. Some work to do, but should be possible at least.
Yes. Not too hard, I think.
>> The other way would be to, on error, set last_committed_sub_id back to
>> it's last value.
>> What do you suggest?
>> Assuming setting the value back would be the easiest solution..
Kristian> Yes (if I understood correctly what you meant with "setting back")...
Kristian> I think we should update things the same way whether the transaction commits
Kristian> successfully or fails, so we get the same wakeup of later transactions, but
Kristian> with the extra error flag added to wait_for_commit so that later transactions
Kristian> can see the failure and thus fail and rollback themselves. And then the error
Kristian> can be propagated up the call stack, as normal.
Kristian> Of course, this becomes more complex due to the need to be able to re-try a
Kristian> transaction in case of certain temporary errors like deadlocks and so.
Kristian> Hopefully something can be made to work.
As we should never get a deadlock or temporary error on a slave, this
should not be too hard for a first version.
>> + /*
>> + We are already executing something else in this domain. But the two
>> + event groups were committed together in the same group commit on the
>> + master, so we can still do them in parallel here on the slave.
>> +
>> + However, the commit of this event must wait for the commit of the prior
>> + event, to preserve binlog commit order and visibility across all
>> + servers in the replication hierarchy.
>> + */
>> + rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
>>
>> Here we could add a test if get_thread() is 0, which could happen on shutdown
>> or when killing a replication thread.
Kristian> I do not think we need to do this. I made it so that the worker thread pool
Kristian> can never change size without stopping all replication threads first, so we
Kristian> avoid a lot of these issues. And as I wrote above, killing a worker thread
Kristian> should terminate the event group that it is currently servicing (and the
Kristian> associated SQL thread and master connection), but not remove the thread from
Kristian> the pool.
I was thinking about the case where all worker threads are occupied
for a long time. In this case we will get do_event() stop a long time
until a new thread is free.
I do see some small advantages in beeing able to free the sql-thread
early.
If get_thread() is not a stop point, then we have the problem at terminate,
that do_event() will wait for get_thread() and then schedule a new query()
even if the server should be stopping.
We need to check for the sql-thread being killed before it tells a
worker thread to start with new queries...
Kristian> (I can imagine that in a later version we want to implement a smarter worker
Kristian> thread manager, but for the first version I tried to keep things simple, there
Kristian> are plenty of other problems that we need to tackle...)
>> I think it's relatively safe to use get_thread() as a end-synchronize
>> point as we should never have anything half-done when we call
>> get_thread()). As get_thread() can take a very long time and is not
>> called in many places, it's also sounds like a resonable thing to do.
Kristian> Sorry, I do not understand. "End-synchronize point" for what?
Sorry, wrong word.
What I meant was a point where it's safe to detect that the server or
replication thread has been killed.
>> + /*
>> + Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread.
>> + Same for events not preceeded by GTID (we should not see those normally,
>> + but they might be from an old master).
>> + */
>> + qev->rgi= serial_rgi;
>> + rpt_handle_event(qev, NULL);
>> + delete_or_keep_event_post_apply(rli, typ, qev->ev);
>>
>> Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()?
>> You always call these two together...
Kristian> It is to share code with the non-parallel case.
Kristian> When parallel replication is disabled, everything is done in the SQL thread in
Kristian> slave.cc. Like this:
Kristian> exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL);
Kristian> delete_or_keep_event_post_apply(serial_rgi, typ, ev);
Kristian> Or did you mean that I could move the call of
Kristian> delete_or_keep_event_post_apply() into the function rpt_handle_event()?
Kristian> That might be possible, though it was not 100% clear to me as there are a few
Kristian> differences in how it is done in the different cases.
Yes, I meant moving the function into rpt_handle_event().
Kristian> But in any case that part of the code needs more work. Especially with regard
Kristian> to error handling, and related to updating the last executed event
Kristian> position. Probably it makes sense to revisit this as part of that work.
ok.
Regards,
Monty
References
-
Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-13
-
Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-13
-
Re: Review of MDEV-4506, parallel replication, part 1
From: Michael Widenius, 2013-09-22
-
Re: Review of MDEV-4506, parallel replication, part 1
From: Kristian Nielsen, 2013-09-23