← Back to team overview

maria-developers team mailing list archive

Re: Ideas for improving MariaDB/MySQL replication


Quoting Henrik Ingo <henrik.ingo@xxxxxxxxxxxxx>:

Meta discussion first, replication discussion below :-)

On Mon, Mar 22, 2010 at 4:41 PM, Alex Yurchenko
<alexey.yurchenko@xxxxxxxxxxxxx> wrote:
Uh, I'm not sure I can accept this proposition. At least it seems
contradictory to MariaDB's vision of being a practical, user and
customer driven, database.

I do understand the desire to marry marketing to software design, but
they are simply unrelated areas of human activity. "Computer science" is
called "science" because there are real laws which no marketing genius can
invalidate. So YMMV.

It is not marketing. Science can produced things with practical value,
and things with little or no practical value. We want to produce
things with practical value.

As I see it, for real world applications, you should always start with
I never suggested to implement a model without connection to use cases,
and I believe I went to sufficient lengths to explain how proposed model
can satisfy a broad range of use cases. What I was saying, that you're
always programming a model, not use cases and therefore anything that you
want to implement must be expressed in terms of the model.

This is true. Skipping the part where you create a model leads to chaos.

In this connection saying that you have a use case that does not need
linearly ordered commits really means nothing. Either you need to propose
another model, live with linearly ordered commits or drop the case. Either
way it has no effect on the design of this model implementation, because
linearly ordered commits IS the model. You cannot throw them out without
breaking the rest of the concept. So much for the usefulness of use cases
in high-level design: some of them fit, some of them don't.

I'm not sure about where Kristian is, but at least my participation is
based on the assumption that we are still exploring the proposed model
to see if we like it or whether we should modify it or have a
different model. This assessment is based on asking what use case are
served well by the model.

I'm also a fan of abstract thinking though. Sometimes you can get
great innovations from starting with a nice abstract model, and then
ask yourself which real world problems it would (and would not) solve.

And that's exactly what I'm trying to do in this thread - start with a
model, not use cases.

Either way, you end up with anchoring yourself in real world use

Well, when you start with a model, it means that you use it as a reference
stick to accept or reject use cases, doesn't it? So that makes the model
anchor. And leaves use cases only as means to see how practical the model

No, this is what I disagree with. You could propose a model that is
sound in a theoretical sense, but useless in practice because it
doesn't serve any use cases that real world users are interested in.
So the use cases are there reference stick to accept or reject the
model. But also the full set of use cases are not set in stone. We can
decide that we like a model because it serves many use cases and then
we reject the use cases not served by it.

And there is another curious property to models: the more abstract is the
model (i.e. the less it is rooted in use cases), the more use cases it can
satisfy. Once you stop designing specifically for asynchronous replication,
you find out that the same scheme works for synchronous too.

True. Abstract thinking sure is a win, there's no question about that.
But universities are also full of those scientists who produce little
of practical value. I worked one year at HUT - it was the most relaxed
job I ever had, there is no requirement to produce anything useful
unless you really want to. My masters thesis contributes something new
to the field of eLearning, that nobody had researched before. But if I
had to explain the main points of it in a business world, I could do
so in 60 seconds. The rest is just "scientific fluff".

Good science has practical value (sometimes apparent only after
decades). But not everything that happens in science is good science.

Back on track: So the API should of course implement something which
has as broad applicability as possible. This is the whole point of
questioning you, since now you have just suggested a model which
happens to nicely satisfy Galera's needs :-)

Well, this may seem like it because Galera is the only explicit
implementation of that model. But the truth is Galera is possible only
because this model was explicitly followed. And this model didn't come out
of thin air. It is a result of years of research and experience - not only

Yes. The model certainly looks sound and promising, no question about
that. I think the discussion is more about corner cases.

For example, MySQL|MariaDB is already implementing large portion of the
proposed model by representing evolution of a database as a _series_ of
atomic changes recorded in a binlog. In fact it had global transaction IDs
from day one. They are just expressed in the way that makes sense only in
the context of a given file on a given server. Had they been recognized as
global transaction IDs, implementing a mapping from a file offset to an
ordinal number is below trivial. Then we would not be having 3rd party
patches applicable only to MySQL 5.0. (Let's face it, global transaction
IDs in master-slave replication are so trivial they are practically built
in.) The reason why there is no nice replication API in MariaDB yet is
this model was never explicitly recognized. And API is a description of a
model. You cannot describe what you don't recognize ;)


So in reality I am not proposing anything new or specific to Galera. I'm
just suggesting to recognize what you already have there (and proposing the
abstractions to express it).

And imho this joint effort is looking really promising all in all,
since so many experts are exchanging their wisdom. (Not really
counting myself here, although I've read many white papers about
replication :-)

So those are the requirements I could derive from having NDB use our
to-be-implemented API. My conclusion from the above is that we should
consider adding to the model the concept of a transaction group,
 -> the engine (or MariaDB server, for multi-engine transactions?) MAY
provide information of which transactions had been committed within
the same group.
 -> If such information was provided, a redundancy service MAY process
transactions inside a group in parallel or out of order, but MUST make
sure that all transactions in transaction group G1 are
processed/committed before the first transaction in G2 is

Well, that's a pretty cool concept. One way to call it is "controlled
eventual consistency". But does redundancy service have to know about it?

If the redundancy service does not know about it, how would the
information be transmitted by it??? For instance take the example of
the binlog, which is a redundancy service in this model. If it
supported this information (which it MAY do), it of course has to save
it in some format in the binlog file.

First of all, these groups are just superpositions of individual atomic
transactions. That is, this CAN be implemented on top of the current

Yes, this is the intent.

Secondly, transaction applying is done by the engine, so the engine or the
server HAS to have a support for this, both on the master and on the slave
side. So why not keep the redundancy service API free from that at all?
Consider this scheme:

Database Server | Redundancy Service
(database data) | (redundancy information)
       Redundancy API

The task of redundancy service is to store and provide redundancy
information that can be used in restoring the database to a desired state.
Keeping the information and using it - two different things. The purpose
API is to separate one part of the program from the logic of another. So
I'd keep the model and the API as simple as free from the server details

What it means here: redundancy service stores atomic database changes in a
certain order and it guarantees that it will return these changes in the
same order. This is sufficient to restore the database to any state it
It is up to the server in what order it will apply these changes and if it
wants to skip some states. (This assumes that the changesets are opaque to
redundancy service and the server can include whatever information it
in them, including ordering prefixes)

Ok, this is an interesting distinction you make.

So in current MySQL/MariaDB, one place where transactions are applied
to a replica is the slave SQL thread. Conceptually I've always thought
of this as "part of replication code". You propose here that this
should be a common module on the MariaDB server side of the API,
rather than part of each redundancy service. I guess this may make

This opens up a new field of questions related to the user interface
of all this. Typically, or "how things are today", a user will
initiate replication/redundancy related events from the side of the
redundancy service. Eg if I want to setup mysql statement based
replication, there is a set of commands to do that. If I want to
recover the database by replaying the binlog file, there is a set of
binlog specific tools to do that. Each redundancy service solves some
problems from its own specific approach, and provides a user interface
for those tasks. So I guess at some point it will be interesting to
see what the command interface to all this will look like and whether
I use something specific to the redundancy service or some general
MariaDB command set to make replication happen.

This replication model will eventually influence the user interface.
So far, in Galera project, we have postponed user interface changes for the future. Partly because, our intention is to be transparent to native MySQL, and partly because we wanted to get end user requirements for the management first.

For us, this MariaDB replication project comes just in right time to lay the grounds for replication management syntax.

At least the application of replicated transactions certainly should
not be part of each storage engine. From the engine point of view,
applying a set of replicated transactions should be "just another
transaction". For the engine it should not matter if a transaction
comes from the application, mysqldump, or a redundancy service. (There
may be small details: when the application does a transaction, we need
a new global txn id, but when applying a replicated transaction, the
id is already there.)

yes, but no. .e.g. Galera replication has this strange need to use prioritized transactions for applying. DBMS should have the responsibility to provide high priority sessions for replication appliers.

Follow ups