maria-developers team mailing list archive

Thread
Date

Re: MariaDB Galera replication

To: Pavel Ivanov <pivanof@xxxxxxxxxx>
From: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
Date: Sat, 16 Nov 2013 03:55:46 +0200
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAAG=WUsuH_BCoqgU3UrndrQ_02n1PyaKPMrgzD8augTw=2X86Q@mail.gmail.com>
Organization: Codership Oy
User-agent: Roundcube Webmail/0.9.2

On 2013-11-15 23:59, Pavel Ivanov wrote:

I'm starting a new thread as this is already doesn't have anything to
do with the original topic.


Fair enough.

On Fri, Nov 15, 2013 at 10:46 AM, Alex Yurchenko
<alexey.yurchenko@xxxxxxxxxxxxx> wrote:

Please pardon this arrogant interruption of your discussion andshameless
self-promotion, but I just could not help noticing that Galera
replication
was designed specifically with these goals in mind. And it does seemtoachieve them better than semi-sync plugin. Have you consideredGalera?
What
makes you prefer semi-sync over Galera, if I may ask?
To be honest I never looked at how Galera works before. I've lookedatit now and I don't see how it can fit with us. The majordisadvantages
I immediately see:
1. Synchronous replication. That means client must wait while
transaction is applied on all nodes which is unacceptably big latency
of each transaction. And what if there's a network blip and some node
becomes inaccessible? All writes will just freeze? I see thestatementthat "failed nodes automatically excluded from the cluster", but todo
that cluster must wait for some timeout in case it's indeed a network
blip and node will "quickly" reconnect. And every client must waitfor
cluster to decide what happened with that one node.
2. Let's say node fell out of the cluster for 5 minutes and then
reconnected. I guess it will be treated as "new node", it will
generate state transfer and the node will start downloading the whole
database? And while it's trying to download say 500GB of data files
all other nodes (or maybe just donor?) won't be able to change those
files locally and thus will blow up its memory consumption. Thatmeans
they could quickly run out of memory and "new node" won't be able to
finish its "initialization"...
3. It looks like there's strong asymmetry in starting cluster nodes--
the first one should be started with empty wsrep_cluster_address and
all others should be started with the address of the first node. So I
can't start all nodes uniformly and then issue some commands to
connect them to each other. That's bad.
4. What's the transition path? How do I upgrade MySQL/MariaDB
replicating using usual replication to Galera? It looks like there's
no such path and the solution is stop the world using regular
replication and restart it using Galera. Sorry I can't do that with
our production systems.

I believe these problems are severe enough for us, so that we can't
work with Galera.
Pavel, you seem to be terribly mistaken on almost all accounts:

1. *Replication* (i.e. data buffer copying) is indeed synchronous. But
nobody said that commit is. What Galera does is very similar tosemi-sync,
except that it does it technically better. I would not dare to suggest
Galera replication if I didn't believe it to be superior to semi-syncin
every respect.


Well, apparently we have a different understanding of what the term
"synchronous replication" means. This term is all over the Galera doc,
but I didn't find the detailed description of how actually Galera
replication work. So I assumed that my understanding of the term
(which actually seem to be in line with wiki's definitions
http://en.wikipedia.org/wiki/Replication_(computing) ) is what was
implied there. So I hope you'll be able to describe in detail how
Galera replication works.


There can be much detail ;) I'll start with this:

1) During transaction execution Galera records unique keys of the rowsmodified or referenced (foreign keys) by transaction.2) At prepare time it takes the keys and binlog events from the threadIO cache and wraps them into a "writeset".3) The writeset is synchronously copied to all nodes. This is the onlysynchronous operation and can be done either over TCP or multicast UDP.All nodes, including the sender receive writesets in exactly the sameorder, which defines the sequence number part of the GTID. The writesetis placed in the receive queue for further processing.4) The writeset is picked from the queue and (in seqno order) is passedthrough certification algorithm which determines whether the writesetcan be applied or not and also which writesets it can be applied inparallel with.5) If certification verdict is positive, master commits the transactionand sends OK to client, slave applies and commits the binlog events fromthe writeset.6) If certification verdict is negative, master rolls back thetransaction and sends deadlock error to client, slave just discards thewriteset.

In the end transaction is either committed on all nodes (except forthose that fail) or none at all.

Here is a picture of the process:http://www.codership.com/wiki/doku.php?id=certification. Thecertification algorithm itself was proposed by Fernando Pedone in hisPhD thesis. The idea is that by global event ordering allows us to makeconsistent decisions without the need for additional communication.

Note that if only one node in the cluster accepts writes, certificationwill always be positive.

As an example here's an independent comparison of Galera vs.
semi-sync performance:
http://linsenraum.de/erkules/2011/06/momentum-galera.html.


This is a nice blog post written in German and posted in 2011. And

You don't seriously expect that something has changed in that departmentsince then, do you? ;)

while Google Translate gave me an idea what post was about it would be
nice to see something more recent and with better description of what
was the actual testing set up.

Sure thing, but who will bother? However here's something from 2012 andin English - but no pictures:http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-cluster-with-semi-sync-replication-cross-wan/

Being WAN test it may be not directly relevant to your case, but itkinda shows that Galera replication is more efficient than semi-sync inWAN, and is likely to be also more efficient in LAN. In fact, given thatsemi-sync replicates one transaction at a time, it is hard to be lessefficient than semi-sync. Only through deliberate sabotage.

In fact, majority
of Galera users migrated from the regular *asynchronous* MySQLreplication,
which I think is a testimony to Galera performance.


I don't mean to troll, but this can also mean that everyone who
migrated didn't care much about performance and Galera's performance
was within sane boundaries...

BTW, just found here
https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ :
"by design performance of the cluster cannot be higher than
performance of the slowest node; however, even if you have only one
node, its performance can be considerably lower comparing to running
the same server in a standalone mode". That contradicts your words.

Replication has its overhead, and it is not inconceivable to create aload where that overhead will dominate. Still I doubt that it will behigher than that of a standalone server WITH BINLOG ENABLED. At leastwith real life loads.

2. Node reconnecting to cluster will normally receive only events thatit
missed while being disconnected.


This seem to contradict to the docs. Again from
https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ :
"After a temporary split, if the 'good' part of the cluster was still
reachable and its state was modified, resynchronization occurs".

Yes, but it does not specify the sort of synchronization - whether it isa full state snapshot transfer or merely a catch up with missingtransactions. But, depending on the circumstances any of those canoccur.

3. You are partially right about it, but isn't it much different from
regular MySQL replication where you first need to set up master andthenconnect slaves (even if you have physically launched the servers atthe same
time).


Operation of setting up master and then connecting slaves consists of
mostly only executing CHANGE MASTER TO and then START SLAVE on all
slaves after all MySQL instances (including master) were started with
the same set of command line flags. This is fundamentally different
from starting instances with different arguments, especially when
these arguments should be different depending on whether the replica
is starting first or there's already some other replica running.

It looks like either way you have to treat master and slavesdifferently. However with modern Galera this difference simply boils to:- you start the first node of a cluster with service mysql start--wsrep-new-cluster

- you start all other nodes with just service mysql start.
(wsrep_cluster_address can be the same on all nodes)

Yet, Galera nodes can be started simultaneously and then joined
together by setting wsrep_cluster_address from mysql clientconnection. Thisis not advertised method, because in that case state snapshot transfercan
be done only by mysqldump. If you set the address in advance, rsync or
xtrabackup can be used to provision the fresh node.


This is of course better because I can start all instances with the
same command line arguments. But transferring snapshot of a very big
database using mysqldump, and causing the node that creates mysqldump
to blow up memory consumption during the process, that is still a big
problem.

How would you do this with semi-sync? Restore from backup and replaymissing events? Well, you can do the same with Galera.

4. Every Galera node can perfectly work as either master or slave tonative
MySQL replication. So migration path is quite clear.


Nope, not clear yet. So I'll be able to upgrade all my MySQL instances
to a Galera-supporting binary while they are replicating using
standard MySQL replication. That's good. Now, how the Galera
replication is turned on after that? What will happen if I just set
wsrep_cluster_address address on all replicas? What will replicas do,
and what will happen with the standard MySQL replication?


Ok, I was clearly too brief there.

1) you shutdown the first slave, upgrade software, add requiredconfiguration, restart it as a single node cluster, connect it back tomaster as a regular slave.2) for the rest of the slaves: shut down the slave, upgrade software,add required configuration, join it to Galera cluster. Galera clusterfunctions as a single collective slave now. Only Galera replicationbetween the nodes. Depending on how meticulous you are, you can avoidfull state snapshot if you take care to notice the offset (in the numberof transactions) between the moments the first and this nodes were shutdown. Then you can forge the Galera GTID corresponding to this nodeposition and just replay missing transactions cached by the first node(make sure it is specified in wsrep_sst_donor). If the node does notknow its Galera GTID, then, obviously it needs full SST.3) when all nodes are converted perform master failover to one of Galeranodes like you'd normally do. Now you can stop the remaining slave.

4) Convert former master as per 2)

If this looks dense, quick Google search gives:
http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-cluster
https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/Migrate%20Master%20Slave%20to%20Cluster.rst

It is very sad that you happen to have such gross misconceptions about
Galera. If those were true, how would MariaDB Galera Cluster getpaying
customers?


Care to share some numbers? Like what's the rough amount of those
paying customers? What size is the biggest installation -- number of
clusters, replicas, highest QPS load?
I'm not asking to share any confidential information, but the rough
ballpark of the numbers would be helpful.

Unfortunately I'm not at liberty to discuss paying customers, especiallygiven that many of them are customers of our partners, and I myself amnot privy to the details. The point of that remark was that we aremaking a living, and it would be very hard to make a living on somethingthat is no better than MySQL semi-sync, especially given the quality ofour marketing materials ;)

Some public material is available at our site:http://www.codership.com/user-stories. However it mostly contains nohard numbers.

May be my reply will convince you to have a second look at it.
(In addition to the above Galera is fully multi-master, does parallel
applying and works great in WAN)


I hope your explanation of how Galera replication work will help me
understand how great it works over WAN and how you could make full
multi-master work without fully synchronous replication in my
understanding of that term.


Pavel


--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Follow ups

Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-16

References

Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-15