maria-developers team mailing list archive

Thread
Date
Re: MaxScale as a binlog server

To: Mark Riddoch <mark.riddoch@xxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 18 Mar 2014 09:02:41 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1A7B071F-DE45-4AEC-B180-24B47F49D985@skysql.com> (Mark Riddoch's message of "Mon, 17 Mar 2014 16:13:52 +0000")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Mark Riddoch <mark.riddoch@xxxxxxxxxx> writes:

> we have been approached by a user with a suggestion for a project that we
> might do with them. They have a replication environment with a large fan out
> (100 slaves per master) and are looking at doing something to cut one the
> load on the master. the thought they had was to use MaxScale to effectively
> cache the bin log files, so MaxScale would act as a single Slave to the
> Master and would itself have a number of Slaves that read the bin log from
> MaxScale. My initial response was why not just use an instance of MariaDB as
> the intermediate node and connect slaves to one of those instances, in a
> tree structure. They had a few reasons why they did not want to do this;

> Looking in more detail I think it is feasible, we essentially write a router
> module for MaxSale that acts as a slave, aching the bin log locally on the
> MaxScale node. We then have another router, or maybe even the same one, that
> relays that binlog to the real slaves. We are looking at doing something for
> MariaDB 10, since we would want to utilise the GTID to get the best failover
> semantics, I am assuming there must be some differences between the
> replication stream that I can find documented for MySQL and what MariaDB 10
> does? Is this stuff anywhere on the Knowledge Base? I tried looking but

Right, so I agree with what Serg said, that maybe going back to the original
problem and seeing if there are better solutions possible would be
better. However, I will try to answer the technical aspect here.

I suppose caching and routing binlogs outside of the server should be
possible, but it might be more complex than you would think at first.

I am not aware of any comprehensive documentation for this. However, it should
be relatively easy to see from the code. Basically, it is the function
mysql_binlog_send() in sql/sql_repl.cc that handles sending binlog data to a
slave. There are good comments in there describing many of the trickier
points.

You need to be aware that there is more logic in sending binlog data to a
slave than just streaming raw binlog files. Maybe you can implement a subset
and document any limitations.

The old-style replication (not using GTID) is the simplest, but it still does
a bit extra, like sending some extra events (FORMAT_DESCRIPTION_EVENT and fake
ROTATE_EVENT).

There is also the handling of old slaves. A slave sends a value in
@mariadb_slave_capability, and the master will rewrite or remove any events
that the slave does not understand, as appropriate.

And there is the @@skip_replication flag; when the slave sets this, the master
removes events that were logged with @@skip_replication set (to reduce
bandwidth needs).

The slave also does several SQL queries to obtain additional information. For
example, the slave does a SELECT binlog_gtid_pos(file, offset) to obtain the
GTID position associated with a given old-format replication position, so that
it knows how to switch to GTID mode later. Maybe those queries could be
handled by relaying them back to a real master server.

The GTID mode is significantly more complex, due to a GTID position having
multiple streams, one per replication domain. The slave sends
@slave_connect_state, @slave_gtid_strict_mode, @slave_gtid_ignore_duplicates,
and optionally @slave_until_gtid values. The router would need to parse these
values and act accordingly.

The binlog sender needs to keep track of each domain as the binlog files are
scanned. Events in a domain must be skipped until the point specified in
@slave_connect_state is reached. Then events are sent until the point in
@slave_until_gtid is reached, at which point further events must again be
skipped.

The master also sends extra GTID_LIST events, which contain the state of the
binlog at the point of the event, which is needed by the slave to correctly
handle START SLAVE UNTIL master_gtid_pos=XXX and MASTER_GTID_WAIT().

A lot of effort was put into error handling, to support all reasonable, and
even many unreasonable uses of replication, while still giving an error in
cases where things are obviously wrong (to prevent silently doing the wrong
thing or hanging).

For example, if the slave tries to connect at a GTID that the master does not
have, this is normally an error. This is important, because that GTID might be
some transaction that was executed manually on the slave by mistake; if it
only exists on the slave, then the master would endlessly skip events looking
for the GTID that never shows up, and replication would silently
hang.

However, there are special cases where the slave _is_ allowed to connect at a
GTID that does not exist on the master. For example, one might have a slave
with --log-slave-updates=0. Then if this slave is promoted as a new master,
the binlogs will not have the current position, but it is still allowed to
connect to it at the point where it became the new master (but not to a point
earlier than that). Another special case is if the requested GTID (but no
GTIDs following that) was purged from the binary logs, as can easily happen if
a replication domain is unused for long time.

A _full_ implementation of a router for sending binlog to slaves would be some
effort, but it might be reasonable to implement some subset. I suppose the
first step would be for you to go through the code in mysql_binlog_send() and
the functions it calls, and decide on which parts to implement. I will be
happy to answer any questions that might pop up on the way, of course.

> 2. They wanted to maintain the same group commit groups as the original
> master, so as to benefit from the parallel replication.

Note that this may no longer apply to MariaDB 10.0.9. The issue is that an
intermediate slave that is itself a master may have less group commit than the
original master, which reduces the possibility for a third-level slave to
efficiently do parallel replication. But in 10.0.9, it should be possible to
configure the intermediate slave with --binlog-commit-wait-* to get better
group commit.

> 3. They think that can get better failover semantics because they will have
> n (maybe 5 or 10) MaxScales and they will all have the same bin log, so
> failover between them will be easier.

I am curious how it will be assured that they all have the same binlog in case
of crash? But maybe you just mean that all will be a subset of the largest
one.

Hope this helps,

 - Kristian.
Follow ups

Re: MaxScale as a binlog server
From: Mark Riddoch, 2014-03-18