maria-developers team mailing list archive

Thread
Date

Re: MariaDB Galera replication

To: Pavel Ivanov <pivanof@xxxxxxxxxx>
From: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
Date: Sun, 17 Nov 2013 04:05:55 +0200
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAAG=WUvkPS44azmMn4as2LeJay95H-vLdoq08UFD2ddcVm_udw@mail.gmail.com>
Organization: Codership Oy
User-agent: Roundcube Webmail/0.9.2

<snip>

There can be much detail ;) I'll start with this:

1) During transaction execution Galera records unique keys of the rows
modified or referenced (foreign keys) by transaction.
2) At prepare time it takes the keys and binlog events from the threadIO
cache and wraps them into a "writeset".
3) The writeset is synchronously copied to all nodes. This is the only
synchronous operation and can be done either over TCP or multicastUDP. Allnodes, including the sender receive writesets in exactly the sameorder,which defines the sequence number part of the GTID. The writeset isplaced
in the receive queue for further processing.
4) The writeset is picked from the queue and (in seqno order) ispassedthrough certification algorithm which determines whether the writesetcan beapplied or not and also which writesets it can be applied in parallelwith.5) If certification verdict is positive, master commits thetransaction andsends OK to client, slave applies and commits the binlog events fromthe
writeset.
6) If certification verdict is negative, master rolls back thetransaction
and sends deadlock error to client, slave just discards the writeset.
In the end transaction is either committed on all nodes (except forthose
that fail) or none at all.

Here is a picture of the process:
http://www.codership.com/wiki/doku.php?id=certification. Thecertificationalgorithm itself was proposed by Fernando Pedone in his PhD thesis.The ideais that by global event ordering allows us to make consistentdecisions
without the need for additional communication.
Note that if only one node in the cluster accepts writes,certification will
always be positive.
So the picture seem to suggest that certification happens on each
server independently. I don't know how you make sure that the result
of the certification is the same on each server (would be nice to know
that).

Certification test is deterministic provided the writesets are processedin the same order. Group communication transport makes sure that thewritesets are globally totally ordered. That is basically the mainGalera difference: group communication instead of unrelated TCP links.

But anyway looks like you need at least one roundtrip to each
node to deliver writeset and make sure that it's delivered. And I
guess only one misbehaving node will freeze all transactions until
that node is excluded from the cluster. Is that correct?


Yes, you're correct. It's kinda clusterish.

As an example here's an independent comparison of Galera vs.
semi-sync performance:
http://linsenraum.de/erkules/2011/06/momentum-galera.html.


This is a nice blog post written in German and posted in 2011. And

You don't seriously expect that something has changed in thatdepartment

since then, do you? ;)

while Google Translate gave me an idea what post was about it wouldbe
nice to see something more recent and with better description of what
was the actual testing set up.


Sure thing, but who will bother?


Are you serious with these questions? So you are telling me "cluster
is much better than semi-sync", I'm asking you "give me the proof",
and you answer me "who bothers to have a proof"? And you want me to
treat your claims seriously?


That really was a rhetoric question, but if you insist...

One. 2 years ago one dude decided to compare Galera and semi-sync asbest as he could. And it was kinda a clear case. You know how semi-syncworks and you know its hard to do worse. Since then only Jay cared to doit in WAN and it was what everybody expected. Besides that, literally,nobody bothered about semi-sync. Even Kristian told you that he doesnot.

Two. We are kinda busy developing and improving our software. And aslong as we believe that there is enough evidence from the field that oursoftware gets better, it would be irresponsible of us spending time andmoney on churning out quarterly benchmark results, wouldn't it?Especially given that most of those are hardly applicable in real lifeand anyone can dismiss them as skewed. Or expired.

Three. I'm not trying to sell you anything. Had it been aboutasynchronous replication, I would not have spoken at all. Howeversincerely believing that Galera covers all semi-sync use cases, I askedwhy you don't use it. I wanted to know why it does not work for you, whyare you fixing semi-sync instead.

But now we ended up here. In the public mailing list. And that kindamakes me obliged to expose your misconceptions and accept justcriticism.

However here's something from 2012 and in
English - but no pictures:
http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-cluster-with-semi-sync-replication-cross-wan/


This is really ridiculous testing with really ridiculous conclusions.


That's a debatable statement ;) I think many would disagree.

What kind of comparison is that if you are testing 6-replica Percona
Cluster against 2-replica setting with semi-sync?

Well, Jay was comparing Percona Cluster with one master replicating to(eventually) 5 slaves (which is presumably more work) and MySQLsemi-sync with one master replicating to one slave. And he sees thatPercona Cluster does no worse than semi-sync with one client thread andWAY better with several threads. It kinda answers many of your questionsabout performance.

Disabling log_bin
and innodb_support_xa on Percona Cluster is also very nice -- how will
you recover from server crashes?

And what other nodes are for? Don't you yourself want to employsemi-sync to avoid extra flushes? And how would you recover from crashesthen? Here's the quote from your original post which prompted me to askyou about Galera:


"Semi-sync replication for us is a DBA tool that helps to achieve
durability of transactions in the world where MySQL doesn't do any
flushes to disk. As you may guess by removing disk flushes we can
achieve a very high transaction throughput. Plus if we accept the
reality that disks can fail and repairing information from it is
time-consuming and expensive (if at all possible), with such reality
you can realize that flush or no flush there's no durability if disk
fails, and thus disk flushes don't make much sense."

This is exactly what we stand for with Galera: durability throughredundancy. Or am I missing something?

And where will nodes take last events from after network disconnection?


From the cluster. That's what it is there for.

"I ignored quorum arbitration" also
doesn't sound promising even though I don't know what it is.

This really isn't a big deal. He just had two datacenters with equalnumber of nodes in them. Had network been broken between them there'd bea "split-brain". It is relevant to multi-master use case only. Andpractically irrelevant to performance benchmarking that he did.

Being WAN test it may be not directly relevant to your case, but itkindashows that Galera replication is more efficient than semi-sync in WAN,andis likely to be also more efficient in LAN. In fact, given thatsemi-syncreplicates one transaction at a time, it is hard to be less efficientthan
semi-sync. Only through deliberate sabotage.
Well, sure, as long as your only definition of "efficiency" is
something like 32-threaded sysbench results. But how about
single-threaded sysbench results, i.e. average transaction latency in
single-threaded client mode?


That was in the first table:

semi-sync:       102 ms
Percona cluster: 108 ms

Ok, this was not sysbench, it was just manual inserts.

And how about another killer case: what
is the maximum number of parallel updates per second that you can make
to a single row?


But of course, it is now well known, 1/RTT.

When you talk about efficiency you need to talk about a wide range of
different use cases.

2. Node reconnecting to cluster will normally receive only eventsthat it
missed while being disconnected.
This seem to contradict to the docs. Again from
https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ :
"After a temporary split, if the 'good' part of the cluster was still
reachable and its state was modified, resynchronization occurs".
Yes, but it does not specify the sort of synchronization - whether itis afull state snapshot transfer or merely a catch up with missingtransactions.
But, depending on the circumstances any of those can occur.


It would be nice to see what algorithm is used to choose which kind of
synchronization is necessary to do.

It is rather simple: if possible (required transactions are present indonor cache) - replay missing transactions, if not - copy a fullsnapshot. But yes, this area is not totally without gotchas yet...

Yet, Galera nodes can be started simultaneously and then joined
together by setting wsrep_cluster_address from mysql clientconnection.
This
is not advertised method, because in that case state snapshottransfer
can
be done only by mysqldump. If you set the address in advance, rsyncor
xtrabackup can be used to provision the fresh node.
This is of course better because I can start all instances with the
same command line arguments. But transferring snapshot of a very big
database using mysqldump, and causing the node that creates mysqldump
to blow up memory consumption during the process, that is still a big
problem.
How would you do this with semi-sync? Restore from backup and replaymissing
events? Well, you can do the same with Galera.
I'm sorry, but this is not mentioned anywhere in the docs. So I don't
know what Galera allows to do in this case.

It is now plain to see our complete failure with documentation. And Iguess that answers my initial question of why you're not using Galera.

4. Every Galera node can perfectly work as either master or slave to
native
MySQL replication. So migration path is quite clear.
Nope, not clear yet. So I'll be able to upgrade all my MySQLinstances
to a Galera-supporting binary while they are replicating using
standard MySQL replication. That's good. Now, how the Galera
replication is turned on after that? What will happen if I just set
wsrep_cluster_address address on all replicas? What will replicas do,
and what will happen with the standard MySQL replication?
Ok, I was clearly too brief there.

1) you shutdown the first slave, upgrade software, add required
configuration, restart it as a single node cluster, connect it back to
master as a regular slave.
2) for the rest of the slaves: shut down the slave, upgrade software,addrequired configuration, join it to Galera cluster. Galera clusterfunctionsas a single collective slave now. Only Galera replication between thenodes.Depending on how meticulous you are, you can avoid full state snapshotifyou take care to notice the offset (in the number of transactions)betweenthe moments the first and this nodes were shut down. Then you canforge theGalera GTID corresponding to this node position and just replaymissing
transactions cached by the first node (make sure it is specified in
wsrep_sst_donor). If the node does not know its Galera GTID, then,obviously
it needs full SST.
Hm... As Galera is not available for MariaDB 10.0 I assume Galera GTID
is not the same as MariaDB's GTID. This is confusing, and it's
apparently not documented anywhere...

Yes, at the moment it is the case. We develop our patch against Oracle'ssources and then it gets ported to PXC and MariaDB Cluster. CurrentlyMariaDB Cluster is a bit behind and MariaDB GTID support may bechallenging. However this will be of relevance only if you decide toheavily mix Galera and native replication (as in having two Galeraclusters replicate to each other asynchronously). For migration it isprobably of little importance.

3) when all nodes are converted perform master failover to one ofGalera

nodes like you'd normally do. Now you can stop the remaining slave.
4) Convert former master as per 2)

If this looks dense, quick Google search gives:
http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-cluster
https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/Migrate%20Master%20Slave%20to%20Cluster.rst


This is the best advice I've ever heard from (presumably) developer of
a big and complicated piece of software: if you need documentation on
how to use it go google it and you may find some blog posts by someone
who uses it... OK, thanks, I know now how I can find more info on
Galera Cluster.

Sarcasm is good. But if you look at it realistically these were the realworld guys solving their real world problems. How can a developer of notso big, but nevertheless complicated *C++* software provide you withexhaustive instructions on how to do *DBA* stuff, which given theadmitted complexity of the problem and diversity of requirements andapproaches would take volumes? Apparently these guys didn't have it thathard to understand how Galera is applicable to their problem. This isnot to say that our documentation doesn't suck, but how are these blogposts worse than something I would have written? Why should not I referto 3rd party knowledge?

Anyway, as I already said above, the point is taken, even though it isbesides technical merits of Galera.


Regards,
Alex


Pavel


--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Follow ups

Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-17

References

Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-15
Re: MariaDB Galera replication
From: Alex Yurchenko, 2013-11-16
Re: MariaDB Galera replication
From: Pavel Ivanov, 2013-11-16