← Back to team overview

maria-developers team mailing list archive

Re: More suggestions for changing option names for optimistic parallel replication

 

Pavel Ivanov <pivanof@xxxxxxxxxx> writes:

> So the slave coordinator (or I don't remember how you call it) reads
> relay log ahead of the last executing transaction? I.e. it will read
> and assign to threads T1.1, T1.2, then it will read T1.3, detect that
> there are no threads available for execution, but according to what
> you said it will still put this in the queue for thread 1, right? How
> long this queuing can be? Does it keep all queued events in memory?
> Does it depend on the size of the transactions (i.e. how much memory
> can it consume by this queuing)?

Right. The queueing is limited by the configuration variable
--slave-parallel-max-queued, which defaults to 128KB per worker thread. It
does not depend on the size of the transactions (it is possible to replicate a
large transaction without keeping all of it in memory at once). It does need
to keep at least one event in-memory per worker thread, of course, even if an
individual event exceeds --slave-parallel-max-queued.

So memory consumption for queued events is generally limited by
@@slave_parallel_threads * @@slave_parallel_max_queued.

>>     M1 --\           /---S2 ---S3
>>           +-- S1 ---+
>>     M2 --/

>> I think that such a feature, which can break replication unless the user
>> carefully designs the application to avoid it, requires a switch to turn it on
>> or off.
>
> Could there really be cases when multi-domain parallel application of
> transaction is safe on S1, but not safe on S2 or S3?

Such cases can definitely be constructed. For example, suppose S2 is
stopped. User does a long ALTER TABLE t1 on M1, carefully waits for that ALTER to
complete on M1 and on S1, then starts doing DML to table t1 on M2.

Then, when S2 is restarted, it seems likely that it will start executing the
DML in domain 2 before the ALTER in domain 2 has completed, which can break
the replication.

I do agree that in practice, something that breaks domain-based parallel
replication on S2 and S3 is likely to also be able to cause problems on S1.

On the other hand, there does not seem much harm to provide a switch to turn
on or off domain-based parallel replication. (Such mechanism has to be
implemented anyway, as it can only be used in GTID mode, in non-GTID mode
domain-based parallel replication is always off).

>> What I really need is to get some results from testing optimistic parallel
>> replication, to understand how many retries will be needed in various
>> scenarios, and if those retries are a bottleneck for performance.
>
> Then I'd suggest to not add any special processing of such use case,
> but add something that will allow to easily monitor what happens. E.g.
> some status variables which could be plotted over time and show (or at
> least hint on) whether this is significant bottleneck for performance
> or not. This could be something like total time (in both wall time and
> accumulated CPU time) spent executing transactions in parallel, time
> spent rolling back transactions due to this lock conflict, time spent
> rolling back transactions because of other reasons (e.g. due to STOP
> SLAVE or reconnect after master crash), maybe also time spent waiting
> in one parallel thread while transaction is executing in another
> thread, etc.

Yes, I agree, we need more of this. I think the monitoring part of the feature
is currently rather weak, it probably suffers from it being now a long time
since I was doing operations. Hopefully this can be significantly improved in
the near future.

I wonder if such accumulated-time measurements can be added liberally without
significantly affecting performance?

 - Kristian.


Follow ups

References