maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
Date: Fri, 19 Mar 2010 04:50:05 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <877hp9emrj.fsf@knielsen-hq.org>
Organization: Codership Oy
User-agent: RoundCube Webmail/0.3.1

On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Alex, I think this discussion is getting really interesting, and I
> understand
> your points much better now, thanks for your many comments!

Glad to hear that. It'd be very sad otherwise ;)

> 
> Ok, so I think this means that the redundancy service will generate a
new
> RS
> History ID when it starts a new RS History. Exactly when and how this
> happens
> will then depend on the particular plugin.

That's precisely the way to put it: "starts a new RS History". When Galera
node starts with undefined initial state, it generates a new history UUID.
With that, when it will never join a cluster without taking state snapshot
first. Likewise, if it is the first node of the cluster, no other node will
join it without state snapshot. This works remarkably well in preventing
states inconsistency due to operator errors.

>> Well, this is what redundancy service is for: when you crash, you can
>> recover, isn't it? My understanding is that you do want the redundancy
>> service to have the change and that you can take it from there and
commit
>> uncommitted. This is how Galera works and I believe this is how
semi-sync
>> in MySQL works (there's hardly any other way for synchronous
replication
>> to
>> work)
> 
> Aha, now I see what you mean!
> 
> So the redundancy service would provide the durability, using its
> transaction
> log or whatever. This is quite different from current MySQL, where every
> engine, as well as the binlog code, is individually responsible for
> durability, and then consequently the 2-phase commit is needed to ensure
> that
> everyone has the same idea of what should be durable.
> 
> So this is actually quite an appealing idea! With this system, each
storage
> engine (and secondary replication plugins like binlog) would no longer
> need to
> ensure that transactions are durable, just that they are able to recover
> after
> a crash into a consistent state.

Yes, the idea of this model is that the main purpose of redundancy is
durability, which, depending on a plugin, can be of much higher degree than
flush to disk (e.g. binlog to a remote machine with the point-in-time
recovery ability). There is a subtle moment here however: asynchronous
redundancy plugin won't give you that durability. So, again there are
several way to go about that: some applications may be willing to trade
some durability (few last transactions) for speed. Or you can still have an
option to ensure durability on engine level. Maybe something else.

> This could translate concretely into (for example) setting
> innodb_flush_log_at_txn_commit=0 without risking to loose transactions.
So
> no
> fflush() per commit in the engine (and I think therefore no need for
group
> commit).

Exactly. That's our default setting for Galera.

> So only the (primary) redundancy service would need to fflush() (or
> whatever)
> to achieve durability.
> 
> So this is quite interesting! The whole issue of coordinating the state
> between engines, master binlog thread, and slave threads, in the case of
> crash
> recovery, is quite a mess in current MySQL, and this indeed sounds much
> cleaner.
> 
> So let me list some potential issues with this that spring to my mind,
and
> see
> how they could be resolved:
> 
> 
> 1. On crash recovery, the engine must be able to tell where in the RS
> History
> they have recovered themselves to, so that they can correctly be
supplied
> with
> the remaining transactions that they must apply.
> 
> The best way to do this is probably to extend the storage engine API
(and
> the
> engines) with a facility for this, so that the engines can log it in
their
> transaction log. However, it seems simple enough to emulate it for
engines
> that do not implement such extensions. It should be sufficient to create
a
> hidden table inside the engine with a single "transaction_id" column,
and
> just
> insert a new row with the ID just before committing in the engine.
>
> Now, engine commit must be expected to fail from time to time. So it
seems
> hard to supply the engine with the global transaction ID (as that will
not
> be
> generated until engine commit_prepare has succeeded).
> 
> But we don't actually need the _global_, linearly ordered ID for this.
It
> should be sufficient with the local transaction ID generated at the
start
> of
> the transaction, as long as the redundancy service records the mapping
> between
> this and the global transaction IDs. And of course if we choose to
rollback
> later then there will be no matching global transaction ID (which is
fine).

This is an interesting option indeed. However
1) the mapping itself should be durable, so every plugin must design the
way to recover it in the case of crash.
2) global transaction ID better be integral part of the server state,
otherwise it will complicate state snapshot transfer. E.g. it will be
insufficient to just use mysqldump or copy db files, you'll need to carry
global transaction ID along.

The above issues are of course resolvable one way or another. It is just
not obvious what is easier: fix the engine or implement a workaround (if
you consider it in all entirety)

This does not end here. If we want flexibility to define Redundancy Sets
(like limiting the scope of replication to certain tables or have several
RS's per server) we'll need to be able to maintain several global
transaction IDs and map them to RS somehow, as well as store the
description of Redundancy Sets themselves. Probably in a table like
(rs_name, table_list, plugin_stack, history_uuid, global_trx_ID). The
simplest application: graceful server restart. (Notice that since each RS
has a separate plugin stack, there are no complications on plugin side)

The above consideration removes the global transaction ID per engine, but
brings in a separate table that should implicitly participate in every
transaction. I see too many variants here which I can't really assess. I'd
like to hear what DB guys would say.

>>> But true, it is essential that storage engine(s) and redundancy
>> service(s)
>>> commit in the same order. It is however not yet clear to me that
asking
>>> redundancy service first is sufficient for this, nor that it is
>> necessary.
>>
>> As far as I understand, any (semi)synchronous replication requires
that.
> 
> Right ... I have to study (semi)synchronous replication techniques a bit
> more,
> I'm not too familiar with them.

Ok, I might be too rush here. (Semi)synchronous guarantee is that you
don't send OK to client until the change is replicated to (at least some)
other nodes. So technically you can do replication after local commit.
Still there are 3 pros to call redundancy service before commit as I
mentioned before:
1) it is required for consistent ordering in multi-master case
2) it gives you durability with synchronous plugins
3) if you go asynchronous replication, why not start doing it really
asynchronously, that is ASAP?

Note, that if you call redundancy service last, you still have to hold
commit_order_lock for the duration of the call, so you don't win anything
this way.

>> Hm. I'm not well versed in MySQL engine interfaces, so I might be
missing
>> something here. But why can't we have something like this instead:
>>
>>    engine->txn_prepare(thd);
>>
>>    if (redundancy_service->pre_commit(thd) == SUCCESS)
>>    {
>>       engine->txn_commit(thd);
>>       redundancy_service->post_commit(thd);
>>    }
>>    else
>>    {
>>       engine->txn_rollback(thd);
>>    }
>>
>> pre_commit() call locks commit_order_lock in the right order.
>> post_commit() releases commit_order_lock and the resources that might
be
>> allocated by transaction in the redundancy service. Or, if txn_commit()
>> is
>> to heavy an operation, then maybe like that?:
>>
>>    engine->txn_prepare(thd);
>>
>>    if (redundancy_service->pre_commit(thd) == SUCCESS)
>>    {
>>       engine->txn_fix_order(thd);
>>       redundancy_service->post_commit(thd);
>>       engine->txn_commit(thd);
>>    }
>>    else
>>    {
>>       engine->txn_rollback(thd);
>>    }
> 
> There are (at least) two issues of interest:
> 
> 1. Can engines commit in different order from redundancy service (and if
> they
> can, does it matter)
> 
> 2. Can we support group commit (or obtain equivalent performance in some
> other
> way)?
> 
> To give durability, the redundancy service will typically need to
fflush()
> when it commits and generates the global transaction ID. So having group
> commit means that while such fflush() for transaction T1 is running,
other
> transactions T2, T3, ... Tn are free to start (though not complete of
> course)
> committing in the redundancy service. So that they can queue up, and
once
> T1
> fflush() is done, a single fflush() will be sufficient to commit all of
T2,
> ... Tn.
> 
> So this means that T2 must be able to complete engine->txn_prepare()
while
> T1
> is still running inside redundancy_service->pre_commit(). There cannot
be a
> global lock around all of the above code.
> 
> So this in turn means that T3 will be able to complete
> engine->txn_prepare()
> _before_ T2, even though T3 will commit _after_ T2 (at least in terms of
> redundancy service).
> 
> So the question is if T3 can complete engine->txn_prepare() before T2,
yet
> complete txn_commit() after T2. I do not see why not, but I do not know
if
> this is actually the case, I will need to check.

Hm, how is it different from how it is done currently in MariaDB? Does
txn_commit() have to follow the same order as txn_prepare()? If not, then
the commit ordering imposed by redundancy service should not be a problem.

> By the way, is it even necessary to have
> redundancy_service->post_commit()? It
> seems to me that it probably is not needed?
> 
> (Though something like pre_commit() and post_commit() will probably be
> needed
> to support the SQL-level 2-phase commit/XA. This is the SQL syntax the
user
> can use to do two-phase commit between multiple systems used in the
> application).
> 
During local transaction execution some resources will be inevitably
allocated for transaction by redundancy service (row modifications, etc.),
there might be some buffers shared between the engine and redundancy
service (opaque to redundancy service indeed), and finally, in the above
code redundancy_service->pre_commit() locks commit_order_lock.
redundancy_service->post_commit() purpose is to release those resources and
the lock.

Moreover you will need redundancy_service->post_rollback() to release
resources in case of transaction rollback - didn't show it in pseudocode
for simplicity.

Regards,
Alex

-- 
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-28
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-19

References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-18