maria-developers team mailing list archive
Mailing list archive
答复: MDEV-520: consider parallel replication patch from taobao patches
I just reposted the patch file in the last setion in our page.
This version shows how I want to change it to a "pure slave"
A dummy_rli is added, and all the remote_slave_xxx variables are not neecssary now.
My plan next as:
1 Change the case insensitivity bug that you have mentioned before.
2 Run mysql-test suits and pass all the testes.
发送时间: 2012年10月4日 23:24
到: Kristian Nielsen; maria-developers@xxxxxxxxxxxxxxxxxxx
Cc: Sergei Golubchik
主题: 答复: MDEV-520: consider parallel replication patch from taobao patches
Thank you very much for looking into the patch and comments. You have got a deep understanding of the patch.
Let's discuss the issues one by one.
1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break;
This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL).
As you mentioned in last mail, this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok.
2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash.
3. There is lock protection in dispatch_transaction(). It is pk_hash_lock in the function "check_pk_exist_key"
4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction.
5. We handle the events including Query_event, Table_map_event, Write_row_event, Update_row_event, Delete_row_event, Xid_event. For all other events, in current implement, I just run it locally (that is in Transfer), but not in the remote_slave. This will lose some information when master use statment-based-binlog. If the master use row-based-binlog, I have not found the serious affects in data consistency.
As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems.
About some strategies
1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority.
2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue.
3. There are N(default as 16) threads for parallel execution. But the No. 0 thread is special. It is used for the transactions whose affecting rows do not have any primary key or unique key, that means all modification on no-primary-rows will be execute serially.
About uses and plans
I tested it using some simple cases, valgrind used during the cases. The important method for testing is using the productive data. Before a project use it, a necessary step is to build a slave(and a Transfer) in the test environment, data consistency will be checked after running for some days.
In my plan, the next modification is to change it to the "pure slave", that is to run every event using the ev->apply_event directly locally. After it is done, the test-case suit can be used for test.
I will ping you after I upload the new version of the patch.
发件人: Kristian Nielsen [knielsen@xxxxxxxxxxxxxxx]
发送时间: 2012年10月3日 19:26
到: 丁奇; maria-developers@xxxxxxxxxxxxxxxxxxx
Cc: Sergei Golubchik
主题: MDEV-520: consider parallel replication patch from taobao patches
I read through your patch found here:
I am still a bit confused about the status of this patch. It would be useful
if you could explain a bit about the status of the patch:
- How have you tested the patch.
- How are you currently using the patch? Or is it still just
- What are your current plans for the patch? (You mentioned that you have
some changes you are implementing at the moment).
Anyway, I think I have a fair understanding of the overall idea, which seems
very interesting! I will try to summarise here in a bit more detail, as well
as give a few general comments on the patch. Please correct me if I
The basic idea is to build a set (implemented as a hash table) of every unique
key in every row-event in each transaction received from the master. If two
transactions have no overlapping keys, they can be applied in parallel
(foreign_key_checks is set to OFF to make this work).
There is a single thread that processes each event received in sequence. It
checks the key set of the new transaction against the key sets of all
transactions queued for each of N worker threads:
- If there is no conflict with any already queued transaction, the new
transaction can be queued to any worker thread (the one with fewest queued
transaction is chosen).
- If there is a conflict with just one thread, then the transaction can be
queued for that thread - it will then execute in correct sequence with the
- If there is a conflict with multiple threads, wait until the conflicting
transactions have been processed. No new transactions are queued while
Then there are N (default 16) threads that apply the queued
transactions. Queued transactions are processsed in parallel within one
thread, but transaction in different queues are processed in parallel with
If a transaction is received which is not using row events (statement-based
replication or DDL), then we wait until all queued transactions are
processed. Only after that is the new transaction queued, so that such events
are processed correctly (but not in parallel, of course).
Likewise, if the worker threads get too far behind, we wait with queuing any
I think this is a very interesting approach, and I am happy that someone have
actually tried implementing it.
The biggest potential issue seems to be the performance of the thread that
checks the transactions for conflicts and queues them for the appropriate
- Main issue will be huge transactions, if the size of hash tables of key
values become too big for available memory. Maybe this could be solved by
having some limit above which transactions are not tried in parallel.
- There will be a lot of hash lookups of key values, all done in a single
thread. This could become a limit I suppose - testing will be needed to
understand this fully. But I think there will be many workloads where this
will be useful.
Overall, it seems to me this technique is very promising. It seems to be
better at getting maximum parallelism on the slave than any of the other
approaches that have been considered.
As to the patch itself, it seems a bit immature in the present form, but I
understood you are still working on it. Let us discuss what the plans could be
for finishing it and get it ready for inclusion in MariaDB.
The main criteria is that it must be reasonably complete - it must not crash
or silently corrupt data in any usage. However, reasonable restrictions for
what is supported is fine. (For example, we could say that LOAD DATA is not
supported and give an error, but not crash or silently ignore such
statement). But we can discuss this in more details.
Finally, a couple general comments on the code:
1. get_pk_value() does not seem to handle collations, eg. case insensitivity?
It needs to understand that eg. "abab" and "ABAB" can conflict. I think
this should be solved by copying in values with non-binary collations using
the appropriate strxfrm() variant.
2. remote_query() retries the query in case of deadlock - but that seems
wrrong, doesn't it need to retry the whole _transaction_, not just the
3. What about other events like User_var_event ? Will they be included in a
4. There is a my_sleep(10000) in the code when a transaction conflicts with
two or more workers. It would be better to replace this with a condition
variable so we can wake up as soon as the conflict is resolved. Same for
some other sleeps like in wait_for_all_dml_done().
5. Shouldn't there be some locking in dispatch_transaction()? To protect the
hash lookups when checking conflicts against worker threads doing
6. get_pk_value(). There seems to be a buffer overflow - st_hash_item::key is
fixed size, but there seems no check against overrunning it. One solution
is to just truncate if data does not fit - this can lead to false positives
where potential parallelism is not exploited, but that is still safe and
should be ok.
This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.