maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
From: Henrik Ingo <henrik.ingo@xxxxxxxxxxxxx>
Date: Sat, 20 Mar 2010 13:52:47 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <c86d75d33ea6eda09a7f697afa5af504@localhost>
Reply-to: henrik.ingo@xxxxxxxxxxxxx
Sender: henrik.ingo@xxxxxxxxx

On Wed, Mar 17, 2010 at 9:01 PM, Alex Yurchenko
<alexey.yurchenko@xxxxxxxxxxxxx> wrote:
> The problem is that you cannot really design and program by use cases,
> unorthodox as it may sound. You cannot throw an arbitrary bunch of use
> cases as input and get code as output (that is in a finite time and of
> finite quality). Whether you like it or not, you always program some model.

Uh, I'm not sure I can accept this proposition. At least it seems
contradictory to MariaDB's vision of being a practical, user and
customer driven, database.

As I see it, for real world applications, you should always start with
use cases. But it is ok if you want to come back to me and say that a
subset of use cases should be discarded because they are too difficult
to service, or even contradict each other. But just saying that you'd
like to implement an abstract model without connection to any use
cases sounds dangerous to me.

I'm also a fan of abstract thinking though. Sometimes you can get
great innovations from starting with a nice abstract model, and then
ask yourself which real world problems it would (and would not) solve.
Either way, you end up with anchoring yourself in real world use
cases.

> It is by definition that a program is a description of some model. If you
> have not settled on a model, you're in trouble - and that's where mysql
> replication is. This is a direct consequence of trying to satisfy a bunch
> of use cases without first putting them in a perspective of some general
> abstract model.

Yes. It is ok to say that just use cases without some "umbrella" like
an abstract model will just lead to chaos.

> So now we have a proposed model based on Redundancy Sets, linearly ordered
> global transaction IDs and ordered commits. We pretty much understand how
> it will work, what sort of redundancy it will provide and, as you agreed,
> is easy to use for recovery and node joining. It satisfies a whole bunch of
> use cases, even those where ordering of commits is not strictly required.
> Perhaps we won't be able to have some optimizations where we could have had
> them without ordering of commits, but the benefit of such optimizations is
> highly questionable IMO. MySQL/Galera is a practical implementation of such
> model, may be not exactly what we want to achieve here, but it gives a good
> estimate of performance and performance is good.

Back on track: So the API should of course implement something which
has as broad applicability as possible. This is the whole point of
questioning you, since now you have just suggested a model which
happens to nicely satisfy Galera's needs :-)

But another real-world argument you can make is that we don't need
parallel replication for speed, because at least Galera does well
without it. That should then be benchmarked by someone. The real-world
requirement here is after all "speed", not "parallel replication".

> Now this model may not fit, for instance, NDB-like use case. What options
> do we have here?
>
> 1) Extend somehow the proposed model to satisfy NDB use case. I don't see
> it likely. Because, as you agreed, NDB is not really about redundancy, it
> is about performance. Redundancy is quite specific there. And it is not by
> chance that it is hard to migrate applications to use it.
<cut>

Actually, I don't think the issues with migration/performance has
anything at all to do with how it does replication. (It does have to
do with the partitioning/sharding and just limitations of the MySQL
storage engine interfae.)

But we should distinguish 2 things here: How NDB does it's own cluster
internal (node-to-node) replication can for our purposes be considered
as an engine-internal issue. Otoh MySQL Cluster also uses the standard
MySQL replication and binlog. From there we can derive some
interesting behavior that we should certainly support in the
replication API. Ie hypothetically MySQL Cluster could use our
replication api for geographical replication, as it uses MySQL
replication today, but there could also be some other engine with
these same requirements.

The requirements I can think of is:

1) As Kristian explained, transactions are often committed on only one
pair or a few pairs of nodes, but not all nodes (partitions) in the
cluster. The only cluster-global (or database global) sync point is
the epoch, which will collect several transactions packed between
cluster-wide heartbeats. To restore to a consistent cluster wide
state, you must choose the border between 2 epochs, not just any
transaction.

-> A consequence for the mysql binlog and replication is that the
whole epoch is today considered one large transaction. I don't know if
this has any consequence for our discussion, other than the
"transactions" (epochs) being large. A nice feature here could be
support for "groups of transactions" (not to be confused with group
commit) or sub-transactions, whichever way you prefer to look at it.
This way an engine like NDB could send information about both the
epoch and each individual transaction inside the epoch to the
redundancy services. (The redundancy services then may or may not use
that info, but the API could support it.)

2) Sending of transactions to mysql binlog is asynchronous, totally
decoupled from the actual commit that happens in the datanodes. The
reason is that a central binlog would otherwise become a bottleneck in
an otherwise distributed cluster.

-> This is ok also in our current discussion. If the engine doesn't
want to include the replication api in a commit, it just doesn't do so
and there's nothing we can or need to do about it. For instance in the
case of NDB it is NDB who gives you adequate guarantees for
redundancy, the use of mysql binlog is for other reasons.
(Asynchronous geographical replication, and potentially playback and
point-in-time restoring of transactions.)

3) Transactions arrive at the mysql binlog in a somewhat random order,
and it is impossible to know which order they actually committed in.
Due to (2) NDB does not want to sync with a central provider of global
transaction ID's either.

-> When transactions arrive to the replication api, the NDB side may
just act as if they are being committed, even if they already have
been committed in the engine. The replication api would then happily
assign global transaction id's to the transactions.  As in (2), this
makes redundancy services behind this api unusable for database
recovery or node recovery, the engine must guarantee that
functionality (which they do today anyway, in particular NDB).

-> Transactions "committed" to the replication api become linearly
ordered, even if this order does not 100% correspond to the real order
of how the engine committed them originally. However, I don't see a
problem with this at this point.

-> Assuming that there would be benefit on an asynchronous slave to do
parallel replication, it would be advantageous to be able to commit
transactions "out of order". For instance if we introduce the concept
of transaction groups (or sub-transactions), a slave could decide to
commit transactions in random order inside a group, but would have to
sync at the boundary of a transaction group. (This requirement may in
fact worsen performance, since in every epoch you would still have to
wait for the longest running transaction.)

So those are the requirements I could derive from having NDB use our
to-be-implemented API. My conclusion from the above is that we should
consider adding to the model the concept of a transaction group,
which:
 -> the engine (or MariaDB server, for multi-engine transactions?) MAY
provide information of which transactions had been committed within
the same group.
 -> If such information was provided, a redundancy service MAY process
transactions inside a group in parallel or out of order, but MUST make
sure that all transactions in transaction group G1 are
processed/committed before the first transaction in G2 is
processed/comitted.

> 2) Develop a totally different model to describe NDB use case and have it
> as a different API. Which is exactly what it is right now if I'm not
> mistaken. So that it just falls out of scope of today's topic.

We should not include the NDB internal replication in this discussion.
Or we might in the sense that real world examples can give could ideas
on implementation details and corner cases. But it is not a
requirement. How NDB uses the MySQL row based replication is imho an
interesting topic to take into account.

henrik

-- 
email: henrik.ingo@xxxxxxxxxxxxx
tel:   +358-40-5697354
www:   www.avoinelama.fi/~hingo
book:  www.openlife.cc

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-22

References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
Re: Ideas for improving MariaDB/MySQL replication
From: Henrik Ingo, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17