maria-developers team mailing list archive

Thread
Date

Re: Ideas for improving MariaDB/MySQL replication

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx>
Date: Mon, 29 Mar 2010 20:03:44 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87aats14wu.fsf@knielsen-hq.org>
Organization: Codership Oy
User-agent: RoundCube Webmail/0.3.1

On Mon, 29 Mar 2010 00:02:09 +0200, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Alex Yurchenko <alexey.yurchenko@xxxxxxxxxxxxx> writes:
> 
>> On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
>> <knielsen@xxxxxxxxxxxxxxx> wrote:
> 
>> Hm, how is it different from how it is done currently in MariaDB? Does
>> txn_commit() have to follow the same order as txn_prepare()? If not,
then
>> the commit ordering imposed by redundancy service should not be a
>> problem.
> 
> Ok, I checked, and indeed there is no requirement that prepare is done
in
> same
> order as commit.
> 
> In fact, there seems to be no requirement on ordering of commit among
> different engines and binlog at all in the server itself!
> 
> (Since the XA/2pc in MySQL assumes every engine ensures durability by
> itself,
> there is not requirement for any ordering. In case of a crash, each
engine
> will be able to recover every transaction successfully prepared, so it
is
> just
> a matter of deciding which of them to commit and which to rollback.)
> 
> So agree, there is no problem with the redundancy service imposing some
> order,
> with the purpose of enabling recovery even without durability guarantee
by
> each individual engine.
> 
> ----
> 
> Now, InnoDB _does_ have a requirement to commit in the same order as the
> binlog (due to InnoBackup; if not same commit order, the snapshot made
by
> the
> backup may not correspond to any position in the binlog, which breaks
> restore).
> 
> The way this is implemented in InnoDB is by taking a global mutex in
InnoDB
> prepare(), which is only release in InnoDB commit().
> 
> This is a really bad way to do things :-(. It means that only one
(InnoDB)
> transaction can be running the code between prepare() and commit().
Since
> this
> is where the binlog is written (and the point where the redundancy
service
> makes the transaction durable in our discussions), this makes it
> impossible to
> do group commit!

The way I understood the above is that global mutex is taken in InnoDB
prepare() solely to synchronize binlog and InnoDB commits. Is that so? If
it is, than it is precisely the thing we want to achieve, but instead of
locking global mutex in Innodb prepare() we'll be doing it in
redundancy_service->pre_commit() as discussed earlier:

innodb->prepare();

if (redundancy_service->pre_commit() == SUCCESS) // locks commit_order mtx
{
    innodb->commit();
    redundancy_service->post_commit(); // unlocks commit_order mtx
}
...

This way global lock in innnodb->prepare() can be naturally removed
without any additional provisions. Am I missing something?

On the other hand, if we can reduce the amount of commit ordering
operations to the absolute minimum, as you suggest below, it would only
benefit performance. I'm just not sure about names. Essentially this means
splitting commit() into 2 parts: the one that absolutely must be run under
commit_order mutex protection and another that can be run outside of the
critical section. I guess in that setup all actual IO can easily go into
the 2nd part.

> Again, I think a good solution to this is to have an (optional) storage
> engine
> callback fix_commit_order(). This will be called after successful
> prepare(),
> but before commit(). It should do the minimum amount of work necessary
to
> make
> sure that transactions are committed in the order that
fix_commit_order()
> is
> called. The upper layer (/redundancy service) will call
fix_commit_order()
> for
> all transaction participants under a global mutex, ensuring correct
order.
> 
>     lock(global_commit_order_mutex)
>     fix_binlog_or_redundancy_service_commit_order()
>     for (each storage engine)
>         engine->fix_commit_order()
>     unlock(global_commit_order_mutex)
> 
> (If same commit order is not needed, the fix_commit_order() can be NULL,
> and
> if all fix_commit_order() are NULL there is no need to take the muxes).

What I'd like to correct here is that ordering is needed at least in
redundancy service. You need global trx ID. And I believe storage engines
won't be able to do without it either - otherwise we'll need to deal with
holes in commit sequence during recovery. Also, I'd suggest to move the
global_commit_order_mutex into what goes by
"fix_binlog_or_redundancy_service_commit_order()" (the name is misleading -
redundancy service determines the order, it does not have to fix it) in the
above pseudocode. Locking it outside may seriously reduce concurrency.

> Then InnoDB does not need to hold a global mutex across prepare() /
> commit().
> In fact all it needs to do in fix_commit_order() is to put the
transaction
> into a sorted list of pending commits. Then each transaction in commit()
> needs
> only wait until it is first in this list, which is _much_ better than
> hanging
> in prepare() waiting for _all_ transactions to commit!
> 
> (There are other implementation possible also, of course).
> 
>  - Kristian.

Regards,
Alex
-- 
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-30

References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: MARK CALLAGHAN, 2010-01-24
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-27
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-15
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-16
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-17
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-18
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-03-19
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-03-28