maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: Robert Hodges <robert.hodges@xxxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 15 Mar 2010 13:43:18 +0100
Cc: "maria-developers@xxxxxxxxxxxxxxxxxxx" <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <C7877764.23359%robert.hodges@continuent.com> (Robert Hodges's message of "Thu\, 28 Jan 2010 17\:18\:28 -0800")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Robert Hodges <robert.hodges@xxxxxxxxxxxxxx> writes:

> First of all, we Continuent Tungsten folk have a certain set of problems we
> solve with replication.  Here are the key use cases:

> 3. Replicating heterogeneously between MySQL and other database like Oracle.
> This requires the ability to filter and transform data easily.  Another use
> case of heterogeneous replication is copying across databases of the same
> for application upgrades and migration between database versions.

Yes, this is quite interesting, and somewhat different from normal
MySQL->MySQL replication.

> 4. Ensuring full data protection such that data, once committed, are not
> lost or corrupted.  This includes replicating [semi-]synchronously to
> slaves, performing consistency checks on data, performing point-in-time
> restoration of data (e.g., using backups + a change log), etc.

And also reliable crash recovery. Which I think is not there in 5.1, and in
5.5 is implemented in a way that I fear comes with too high a performance cost
for many of the applications that need it the most (too many fsync()s).

> how Tungsten works).  Here are some features that would make it easier to
> work with the existing replication implementation:
>
> 1.) Synchronous replication.  It's enough if replication slaves can hold up
> commit notices on the master.  The MySQL 5.5 features look like a good start
> but I have not started the implementation and have therefore not hit the
> sharp corners. 

As I understand it, synchronous replication based on current 5.5 features
would first start commit on master, then send binlog to slave, then run and
commit transaction on slave, then finish commit on master. So
transactions-per-second rate would be quite limited. But of course there are
many applications where load is light and this would be useful.

> 2.) CRCs.  CRCs and other built-in features like native consistency checks
> seem like the most glaring omission in the current binlog implementation.
> It's difficult to ensure correct operation without them, and indeed I'm
> confident many MySQL bugs can be traced back to the lack of features to make
> data corruption more easily visible.

Yes.

(If there was a good SQL-level way of matching binlog position/transaction ID
with MVCC snapshot version, consistency checks on tables could be implemented
very well and flexible from outside the server and replication framework. This
would be a very nice feature to have. Though not without problems to
implement...)

> 3.) Self-contained transactions with adequate metadata.  Row replication
> does not supply column names or keys which makes SQL generation difficult
> especially when there are differences in master and slave schema.   Also,

Right... but do you suggest putting the entire table definition of every table
into every transaction? Sounds a bit bloated perhaps?

The row-level binlogging in MySQL is based on column index, not column
name. But I understand that the ability to generate SQL (which is based on
column name rather than index) would be nice.

> session variables like FOREIGN_KEY_CHECKS affect DDL but are not embedded in
> the transaction where they are used.  Finally, character set support is a

Can you elaborate? Wouldn't this also cause bugs in MySQL replication itself?

> little scary based on my one experience in that area.  You have to read code
> to get master lists of character sets; semantics are very unclear.

What are the issues with character set?

> and I need to discuss that over a beer in Helsinki.)  Also, transactions IDs
> need an unambiguous source ID or epoch number encoded in the ID so that you
> can detect diverging serialization histories.  This nasty little problem
> that can lead to big accidents in the field.

Can you elaborate? I don't understand exactly what a "source ID" or "epoch
number" would be. Can you give an example?

> In fact, you could summarize 2-6 as making the binlog (whether written to
> disk or not) into a consistent "database" that you can move elsewhere and
> apply without having to add extra metadata, such as global IDs or table
> column names.  Currently we have to regenerate the log which is a huge waste
> of resources and also have to depend external information to derive schema
> definitions. 

> Finally, since there is already talk about rewriting replication from
> scratch, I would like to point out for the sake of discussion a few things
> that the current MySQL replication in my opinion does well.  Any future
> system must match them.

> 4.) Robust.  There is no lack of problems with MySQL replication but
> realistically any new implementations will have a high bar to function
> equally well.  Plugin approaches like that used by Drizzle are very flexible
> but they also tend to have a kick-the-can-down-the-road effect in that it's
> up to plugins to provide a robust implementation.  This in turn takes a long
> time to do well unless plugins cut down the problem size, for example by
> omitting statement replication.

Yes. I think this is a very good point.

Eg. many of your points could be answered merely by making MySQL binlogging
pluggable and let Tungsten (and everyone else) just implement their own
logging to fit their particular purpose. But there is also a lot to be said
for providing a single really useful binlog implementation. It does not sound
appealing for users to have 3 or 4 different binlogs on their systems, each
supporting a particular plugin (we already have two with the engines internal
transactional log, which is arguably one too many).

> 2.) Fast.  MySQL replication really rips as long as you don't have slow
> statements that block application on slaves or don't hit problems like the
> infamous InnoDB broken group commit bug (#13669) reported by Peter Zaitsev.

Well, this may be true for in-memory working sets. But if you have a larger
system that does not fit in main memory and is bottlenecked by the performance
of the disk system, the single-threaded slave really hurts. It makes it really
hard to scale up on the disk I/O on the slave. Everybody who is into larger
systems seem to mention this.

> * Logical replication based on an enhanced form of today's MySQL replication
> with substantial clean-up of existing code, simplification/enhancement of
> binlog event formats, and other features that we can readily agree upon in
> short order. 

Yes.

So one could imagine making the pluggable replication, and moving the existing
MySQL binlog into a plugin for backwards compatibility. Then we could write
another plugin with enhanced, not backward compatible binlog containing these
enhancements in a more extensible format (eg. column names in transactions
would hurt no-one if they could be easily switched on or off).

(Not sure if current MySQL replication could be suitable extended without a
separate plugin, but if so then so much the better).

 - Kristian.
References

Re: Ideas for improving MariaDB/MySQL replication
From: Robert Hodges, 2010-01-29