maria-developers team mailing list archive

Thread
Date
Re: MariaDB Galera replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Pavel Ivanov <pivanof@xxxxxxxxxx>
Date: Sat, 16 Nov 2013 11:25:09 -0800
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <b894b09560350e4307c55b1ac40dbf0b@codership.com>
On Fri, Nov 15, 2013 at 5:55 PM, Alex Yurchenko
<alexey.yurchenko@xxxxxxxxxxxxx> wrote:
>>>> To be honest I never looked at how Galera works before. I've looked at
>>>> it now and I don't see how it can fit with us. The major disadvantages
>>>> I immediately see:
>>>> 1. Synchronous replication. That means client must wait while
>>>> transaction is applied on all nodes which is unacceptably big latency
>>>> of each transaction. And what if there's a network blip and some node
>>>> becomes inaccessible? All writes will just freeze? I see the statement
>>>> that "failed nodes automatically excluded from the cluster", but to do
>>>> that cluster must wait for some timeout in case it's indeed a network
>>>> blip and node will "quickly" reconnect. And every client must wait for
>>>> cluster to decide what happened with that one node.
>>>> 2. Let's say node fell out of the cluster for 5 minutes and then
>>>> reconnected. I guess it will be treated as "new node", it will
>>>> generate state transfer and the node will start downloading the whole
>>>> database? And while it's trying to download say 500GB of data files
>>>> all other nodes (or maybe just donor?) won't be able to change those
>>>> files locally and thus will blow up its memory consumption. That means
>>>> they could quickly run out of memory and "new node" won't be able to
>>>> finish its "initialization"...
>>>> 3. It looks like there's strong asymmetry in starting cluster nodes --
>>>> the first one should be started with empty wsrep_cluster_address and
>>>> all others should be started with the address of the first node. So I
>>>> can't start all nodes uniformly and then issue some commands to
>>>> connect them to each other. That's bad.
>>>> 4. What's the transition path? How do I upgrade MySQL/MariaDB
>>>> replicating using usual replication to Galera? It looks like there's
>>>> no such path and the solution is stop the world using regular
>>>> replication and restart it using Galera. Sorry I can't do that with
>>>> our production systems.
>>>>
>>>> I believe these problems are severe enough for us, so that we can't
>>>> work with Galera.
>>>
>>>
>>>
>>> Pavel, you seem to be terribly mistaken on almost all accounts:
>>>
>>> 1. *Replication* (i.e. data buffer copying) is indeed synchronous. But
>>> nobody said that commit is. What Galera does is very similar to
>>> semi-sync,
>>> except that it does it technically better. I would not dare to suggest
>>> Galera replication if I didn't believe it to be superior to semi-sync in
>>> every respect.
>>
>>
>> Well, apparently we have a different understanding of what the term
>> "synchronous replication" means. This term is all over the Galera doc,
>> but I didn't find the detailed description of how actually Galera
>> replication work. So I assumed that my understanding of the term
>> (which actually seem to be in line with wiki's definitions
>> http://en.wikipedia.org/wiki/Replication_(computing) ) is what was
>> implied there. So I hope you'll be able to describe in detail how
>> Galera replication works.
>
>
> There can be much detail ;) I'll start with this:
>
> 1) During transaction execution Galera records unique keys of the rows
> modified or referenced (foreign keys) by transaction.
> 2) At prepare time it takes the keys and binlog events from the thread IO
> cache and wraps them into a "writeset".
> 3) The writeset is synchronously copied to all nodes. This is the only
> synchronous operation and can be done either over TCP or multicast UDP. All
> nodes, including the sender receive writesets in exactly the same order,
> which defines the sequence number part of the GTID. The writeset is placed
> in the receive queue for further processing.
> 4) The writeset is picked from the queue and (in seqno order) is passed
> through certification algorithm which determines whether the writeset can be
> applied or not and also which writesets it can be applied in parallel with.
> 5) If certification verdict is positive, master commits the transaction and
> sends OK to client, slave applies and commits the binlog events from the
> writeset.
> 6) If certification verdict is negative, master rolls back the transaction
> and sends deadlock error to client, slave just discards the writeset.
>
> In the end transaction is either committed on all nodes (except for those
> that fail) or none at all.
>
> Here is a picture of the process:
> http://www.codership.com/wiki/doku.php?id=certification. The certification
> algorithm itself was proposed by Fernando Pedone in his PhD thesis. The idea
> is that by global event ordering allows us to make consistent decisions
> without the need for additional communication.
>
> Note that if only one node in the cluster accepts writes, certification will
> always be positive.

So the picture seem to suggest that certification happens on each
server independently. I don't know how you make sure that the result
of the certification is the same on each server (would be nice to know
that). But anyway looks like you need at least one roundtrip to each
node to deliver writeset and make sure that it's delivered. And I
guess only one misbehaving node will freeze all transactions until
that node is excluded from the cluster. Is that correct?

>>> As an example here's an independent comparison of Galera vs.
>>> semi-sync performance:
>>> http://linsenraum.de/erkules/2011/06/momentum-galera.html.
>>
>> This is a nice blog post written in German and posted in 2011. And
>
> You don't seriously expect that something has changed in that department
> since then, do you? ;)
>
>> while Google Translate gave me an idea what post was about it would be
>> nice to see something more recent and with better description of what
>> was the actual testing set up.
>
> Sure thing, but who will bother?

Are you serious with these questions? So you are telling me "cluster
is much better than semi-sync", I'm asking you "give me the proof",
and you answer me "who bothers to have a proof"? And you want me to
treat your claims seriously?

> However here's something from 2012 and in
> English - but no pictures:
> http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-cluster-with-semi-sync-replication-cross-wan/

This is really ridiculous testing with really ridiculous conclusions.
What kind of comparison is that if you are testing 6-replica Percona
Cluster against 2-replica setting with semi-sync? Disabling log_bin
and innodb_support_xa on Percona Cluster is also very nice -- how will
you recover from server crashes? And where will nodes take last events
from after network disconnection? "I ignored quorum arbitration" also
doesn't sound promising even though I don't know what it is.

> Being WAN test it may be not directly relevant to your case, but it kinda
> shows that Galera replication is more efficient than semi-sync in WAN, and
> is likely to be also more efficient in LAN. In fact, given that semi-sync
> replicates one transaction at a time, it is hard to be less efficient than
> semi-sync. Only through deliberate sabotage.

Well, sure, as long as your only definition of "efficiency" is
something like 32-threaded sysbench results. But how about
single-threaded sysbench results, i.e. average transaction latency in
single-threaded client mode? And how about another killer case: what
is the maximum number of parallel updates per second that you can make
to a single row?
When you talk about efficiency you need to talk about a wide range of
different use cases.

>>> 2. Node reconnecting to cluster will normally receive only events that it
>>> missed while being disconnected.
>>
>>
>> This seem to contradict to the docs. Again from
>> https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ :
>> "After a temporary split, if the 'good' part of the cluster was still
>> reachable and its state was modified, resynchronization occurs".
>
> Yes, but it does not specify the sort of synchronization - whether it is a
> full state snapshot transfer or merely a catch up with missing transactions.
> But, depending on the circumstances any of those can occur.

It would be nice to see what algorithm is used to choose which kind of
synchronization is necessary to do.

>>> Yet, Galera nodes can be started simultaneously and then joined
>>> together by setting wsrep_cluster_address from mysql client connection.
>>> This
>>> is not advertised method, because in that case state snapshot transfer
>>> can
>>> be done only by mysqldump. If you set the address in advance, rsync or
>>> xtrabackup can be used to provision the fresh node.
>>
>>
>> This is of course better because I can start all instances with the
>> same command line arguments. But transferring snapshot of a very big
>> database using mysqldump, and causing the node that creates mysqldump
>> to blow up memory consumption during the process, that is still a big
>> problem.
>
> How would you do this with semi-sync? Restore from backup and replay missing
> events? Well, you can do the same with Galera.

I'm sorry, but this is not mentioned anywhere in the docs. So I don't
know what Galera allows to do in this case.

>>> 4. Every Galera node can perfectly work as either master or slave to
>>> native
>>> MySQL replication. So migration path is quite clear.
>>
>>
>> Nope, not clear yet. So I'll be able to upgrade all my MySQL instances
>> to a Galera-supporting binary while they are replicating using
>> standard MySQL replication. That's good. Now, how the Galera
>> replication is turned on after that? What will happen if I just set
>> wsrep_cluster_address address on all replicas? What will replicas do,
>> and what will happen with the standard MySQL replication?
>
>
> Ok, I was clearly too brief there.
>
> 1) you shutdown the first slave, upgrade software, add required
> configuration, restart it as a single node cluster, connect it back to
> master as a regular slave.
> 2) for the rest of the slaves: shut down the slave, upgrade software, add
> required configuration, join it to Galera cluster. Galera cluster functions
> as a single collective slave now. Only Galera replication between the nodes.
> Depending on how meticulous you are, you can avoid full state snapshot if
> you take care to notice the offset (in the number of transactions) between
> the moments the first and this nodes were shut down. Then you can forge the
> Galera GTID corresponding to this node position and just replay missing
> transactions cached by the first node (make sure it is specified in
> wsrep_sst_donor). If the node does not know its Galera GTID, then, obviously
> it needs full SST.

Hm... As Galera is not available for MariaDB 10.0 I assume Galera GTID
is not the same as MariaDB's GTID. This is confusing, and it's
apparently not documented anywhere...

> 3) when all nodes are converted perform master failover to one of Galera
> nodes like you'd normally do. Now you can stop the remaining slave.
> 4) Convert former master as per 2)
>
> If this looks dense, quick Google search gives:
> http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-cluster
> https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/Migrate%20Master%20Slave%20to%20Cluster.rst

This is the best advice I've ever heard from (presumably) developer of
a big and complicated piece of software: if you need documentation on
how to use it go google it and you may find some blog posts by someone
who uses it... OK, thanks, I know now how I can find more info on
Galera Cluster.


Pavel
Follow ups

Re: MariaDB Galera replication
From: Alex Yurchenko, 2013-11-17
References

Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-15
Re: MariaDB Galera replication
From: Alex Yurchenko, 2013-11-16