maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: maria-developers@xxxxxxxxxxxxxxxxxxx
From: seppo.jaakola@xxxxxxxxxxxxx
Date: Fri, 29 Jan 2010 14:36:12 +0200
In-reply-to: <C7877764.23359%robert.hodges@continuent.com>
User-agent: Internet Messaging Program (IMP) H3 (4.3)

Thanks Robert, this is comprehensive enough :)

I'll just address the consistency checking requirement here,
as I believe this is quite widely accepted goal as well.

Tungsten uses a special consistency table for passing consistency
checking information and which is treated in a special way in the
replication. This kind of consistency checking can be managed by
replication system configuration and it does not necessarily
require any new calls in the replication API (except the configuration,
which must be there anyway).

Tungsten consistency checking technology works very well, and there
is no need to "fix it" in any way. However, this method is not directly
usable for multi master replication, because the target node(s) may
have committed some transactions not yet seen in the originating
master node, and these new transactions can interfere with the
consistency check query.

To support multi-master topologies, the Replication API could support
consistency checking directly. At first look, it seems that (at least)
two calls would be needed for it:
* replication_prepare_consistency_check(), to start checking transaction
  with consistent snapshot.
* replication_check_consistency(query, result), to run a wanted
  select query in prepared transaction and comparing results in all
  the nodes.

But how to trigger the consistency checking? That's a question, that
does not touch replication API directly, but affects the SQL layer.
There could be a totally new consistency check command or SELECT query
could have a new option for triggering consistency check.
Consistency call could be also passed as comment (for select), like:
SELECT MD5(...) FROM ...; /* replication-consistency-check */

But, I don't want to dive any deeper in this specification at this point.

Just wanted to bring to the attention the possibility of openingconsistency checking in replication API and SQL layers. And thatmulti-master

consistency checking requires special attention.

Cheers, Seppo

--
http://www.codership.com  seppo.jaakola@xxxxxxxxxxxxx

Quoting Robert Hodges <robert.hodges@xxxxxxxxxxxxxx>:

Hi Kristian,

Thanks for kicking this thread off.  I have had a bit of a busy week so it
has taken a while to get around summarizing Continuent thoughts on
improvements.

First of all, we Continuent Tungsten folk have a certain set of problems we
solve with replication.  Here are the key use cases:

1. Making full DBMS copies for high availability and to scale reads
(traditional MySQL use pattern).  For this we need simple, fast replication
of all databases as well as the ability to provision efficiently from full
database copies.

2. Supporting scalable data topologies for SaaS and other complex database
architectures that have multiple services generating and receiving data.
This can include patterns like:
  * Mastering data in two locations and replicating into a single server
  * Multi-master replication across sites
  * Replicating different sets of data from one master to different slaves

3. Replicating heterogeneously between MySQL and other database like Oracle.
This requires the ability to filter and transform data easily.  Another use
case of heterogeneous replication is copying across databases of the same
for application upgrades and migration between database versions.

4. Ensuring full data protection such that data, once committed, are not
lost or corrupted.  This includes replicating [semi-]synchronously to
slaves, performing consistency checks on data, performing point-in-time
restoration of data (e.g., using backups + a change log), etc.

We can do these things for the most part with MySQL replication as it
currently stands by reading binlogs and taking things from there.  (That's
how Tungsten works).  Here are some features that would make it easier to
work with the existing replication implementation:

1.) Synchronous replication.  It's enough if replication slaves can hold up
commit notices on the master.  The MySQL 5.5 features look like a good start
but I have not started the implementation and have therefore not hit the
sharp corners.

2.) CRCs.  CRCs and other built-in features like native consistency checks
seem like the most glaring omission in the current binlog implementation.
It's difficult to ensure correct operation without them, and indeed I'm
confident many MySQL bugs can be traced back to the lack of features to make
data corruption more easily visible.

3.) Self-contained transactions with adequate metadata.  Row replication
does not supply column names or keys which makes SQL generation difficult
especially when there are differences in master and slave schema.   Also,
session variables like FOREIGN_KEY_CHECKS affect DDL but are not embedded in
the transaction where they are used.  Finally, character set support is a
little scary based on my one experience in that area.  You have to read code
to get master lists of character sets; semantics are very unclear.

4.) Global transaction IDs.  We put can these in ourselves but it would be
even better if MySQL generated them in the logs so we can just use the
binlog directly.  Unlike Galera, we don't require sequence numbers to be
without gaps; it's enough if they are comparable and monotonically
increasing.  (I don't even think forbidding gaps is a good idea but Alexey
and I need to discuss that over a beer in Helsinki.)  Also, transactions IDs
need an unambiguous source ID or epoch number encoded in the ID so that you
can detect diverging serialization histories.  This nasty little problem
that can lead to big accidents in the field.

5.) Simple-to-read formats.  The binlog structure has grown pretty hairy
over the years.  I would love to have protobuf messages like what Drizzle is
doing.

6.) Fewer bugs and weirdnesses.  Handling of commits in the binlog has
switched back and forth over the years.  Mark Callaghan has documented some
of the strange XA behavior and persistent holes in the protocol in this
area.  Statement replication support has led to some kludges make you feel
faint when looking at them.  (Example:  temp tables, which basically don't
work.)

In fact, you could summarize 2-6 as making the binlog (whether written to
disk or not) into a consistent "database" that you can move elsewhere and
apply without having to add extra metadata, such as global IDs or table
column names.  Currently we have to regenerate the log which is a huge waste
of resources and also have to depend external information to derive schema
definitions.

Finally, since there is already talk about rewriting replication from
scratch, I would like to point out for the sake of discussion a few things
that the current MySQL replication in my opinion does well.  Any future
system must match them.

1.) Serialized transactions.  MySQL provides a list of transactions in
serial order.  All robust replication solutions this, and MySQL essentially
gives it to you modulo problems with XA and ill-behaved table types like
MyISAM that don't have guaranteed durability.

2.) Fast.  MySQL replication really rips as long as you don't have slow
statements that block application on slaves or don't hit problems like the
infamous InnoDB broken group commit bug (#13669) reported by Peter Zaitsev.

3.) Network protocol.  The network protocol for slaves is an excellent
feature.  It means you can fetch events and build new forms of replication
without adding native code, which is a huge productivity boost.

4.) Robust.  There is no lack of problems with MySQL replication but
realistically any new implementations will have a high bar to function
equally well.  Plugin approaches like that used by Drizzle are very flexible
but they also tend to have a kick-the-can-down-the-road effect in that it's
up to plugins to provide a robust implementation.  This in turn takes a long
time to do well unless plugins cut down the problem size, for example by
omitting statement replication.

5.) Works now.  Starting from something that already works is sometimes the
best approach.  Radically new approaches take a long time to pay off and
more often than not fail to see the light of day.

One final suggestion I would propose is that it might make sense to take a
two-pronged approach to replication as follows:

* Logical replication based on an enhanced form of today's MySQL replication
with substantial clean-up of existing code, simplification/enhancement of
binlog event formats, and other features that we can readily agree upon in
short order.

* Physical replication based on change-sets that can be applied at the level
of storage engine recovery logs.  (Databases like MyISAM with no logs don't
get to play.)  PostgreSQL has this now and it's really great.  This might be
where to fit in the Galera ideas.  Since this approach would be completely
new it could also take a more radical line without wrecking what is already
there.  It would also fill a gap in the MySQL architecture--while there has
been logical replication for many years, there is no physical replication.

Anyway, I look forward to further discussion

Cheers, Robert
(On behalf of Continuent and Tungsten)

On 1/22/10 6:21 AM PST, "Kristian Nielsen" <knielsen@xxxxxxxxxxxxxxx> wrote:

The three companies Continuent, Codership, and Monty Program are planning to
start working on some enhancements to the replication system in MariaDB,
together with anyone interested in joining in.

At this stage, there are no fixed directions for the project, and to do this

in as open a way possible with the maximum community involvementand interest,

we agreed to start with an email discussion on the maria-developers@ mailing
list. So consider it started!

The plan so far is:

1) The parties taking this initiative, MP, Continuent, andCodership, present

their own ideas in this thread on maria-developers@ (and everyone else who
wants to chime in at this stage).

2) Once we have some concrete suggestions as a starting point, weuse this toreach out in a broader way with some blog posts on planetmysql /planetmariadb

to encourage further input and discussions for possible directions of the

project. Eventually we want to end up with a list of the mostimportant goals

and a possible roadmap for replication enhancements.

(It is best to have something concrete as a basis of a broad community
discussion/process).

To start of, here are some points of interest that I collected. Everyone
please chime in with your own additional points, as well as comments and
further details on these one.

Three areas in particular seem to be of high interest in the community
currently (not excluding any other areas):

 - High Availability
    * Most seems to focus on building HA solutions on top of MySQL
      replication, eg. MMM and Tungsten.
    * For this project, seems mainly to be to implement improvements to
      replication that help facilitate improving these on-top HA solutions.
    * Tools to automate (or help automate) failover from a master.

* Better facilities to do initial setup of new slave withoutdowntime, or

      re-sync of an old master or slave that has been outside of the
      replication topology for some period of time.

 - Performance, especially scalability
    * Multi-threaded slave SQL thread.
    * Store the binlog inside a transactional engine (eg. InnoDB) to reduce
      I/O, solve problems like group commit, and simplify crash recovery.

 - More pluggable replication

* Make the replication code and APIs be more suitable forpeople to build

      extra functionality into or on top of the stock MySQL replication.
    * Better documentation of C++ APIs and binlog format.
    * Adding extra information to binlog that can be useful for non-standard
      replication stuff. For example column names (for RBR), checksums.
    * Refactoring the server code to be more modular with APIs more suitable
      for external usage.

* Add support for replication plugins, types to be determined.For example

      binlog filtering plugins?

It is also very important to consider the work that the replication team at

MySQL is doing (and has done). I found a good deal of interestinginformation

about this here:


http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_and
_6.0_features)

This describes a number of 6.0/5.4 and preview features that we could merge
and/or contribute to. Here are the highlights that I found:

 - Features included in 6.0/5.4 (which are cancelled I think, but presumably
   this will go in a milestone release):
    * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular
      replication.
    * Replication heartbeat.
    * sync_relay_log_info, sync_master_info, sync_relay_log,
relay_log_recovery
      for crash recovery on slave.
    * Binlog Performance Optimization (lock contention improvement).
    * Semi-synchronous Replication, with Pluggable Replication Architecture.
      http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication

 - Feature previews:
    * Parallel slave application: WL#4648
      http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave
    * Time-delayed replication: WL#344
      http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication
    * Scriptable Replication: WL#4008
      http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication
    * Synchronous Replication.

Drizzle is also doing work on a new replication system. I read through the
series of blog posts that Jay Pipes wrote on this subject. They mostly deal
with how this is designed in terms of the Drizzle server code, and is low on
detail about how the replication will actually work (the only thing I could

really extract was that it is a form of row-based replication). Ifsomeone has

links to information about this that I missed, it could be interesting.

Let the discussion begin!

 - Kristian.

_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp



_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Robert Hodges, 2010-01-29

References

Re: Ideas for improving MariaDB/MySQL replication
From: Robert Hodges, 2010-01-29