← Back to team overview

maria-developers team mailing list archive

Re: understanding commit and prepare APIs in MariaDB 5.5

 

Kristian, thank you for the detailed reply. My MariaDB 5.5 questions
have been answered. So in this email, I will move on and explain a bit
what commit does in TokuDB during commit. In another email, I will
continue discussion on 10.0.

Here is a very high level overview.

In TokuDB, all work to modify a table is done by sending messages into
a tree, and those messages trickle down in buffers. A nice explanation
is in http://www.tokutek.com/wp-content/uploads/2012/11/Fractal-Tree-Technology-and-the-Art-of-Indexing.pdf,
the key part beginning with the slide "Now, What's a B-Tree? & a
Fractal Tree Index".

Additionally, checkpoints occur in two phases. The first phase is
short, and locks out many client operations, sweeps through memory
marking all the data as needing a checkpoint. This defines the state
of the checkpoint. So no new work may be done on any data in memory
until we can ensure that it's checkpointed state will be saved to
disk. The second phase is long and does the work to write dirty data
to disk.

On commit we do the following:
 - grab a read lock to prevent the short phase of checkpoint from
starting, called "begin checkpoint".
 - log that the transaction has committed in our recovery log.
 - for every write done provisionally under this transaction, send a
message into the dictionary that states the transaction has committed.
This is a simplification and we have some optimizations, but
algorithmically, this is accurate. This is done so when the message
makes it to the appropriate entry in a leaf node, the system knows the
value that was once provisionally inserted with this transaction is
now committed. We do a similar thing with aborts to let the system
know that the value should be rolled back.
 - remove the transaction from our list of live transactions
 - unlock whatever is necessary
 - release the read lock that prevents begin checkpoint
 - fsync the log

So I have been thinking what it would take to possibly implement a
commit_ordered, and while it is likely possible (because I believe
anything is possible given the motivation), I don't see an elegant
way. In the above steps, everything that happens in between the grab
and release of the begin checkpoint read lock has to be there. We
cannot release the begin checkpoint read lock in between, or we may
have bugs during recovery. We cannot do all of this work in in
commit_ordered, because the stage that sends messages for every write
may be expensive, thereby hurting concurrency. So, for such a thing to
work, we would have to find a way to grab the read lock in
commit_ordered once for each transaction (and because the lock is
fair, we can't just regrab the lock on the same thread), write to the
recovery log, then perform everything else under commit, then release
the read lock. It can probably be done, but it is messy. If
unnecessary, I prefer to not do it.

On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Zardosht Kasheff <zardosht@xxxxxxxxx> writes:
>
>> I am investigating what our storage engine, TokuDB, needs to do to
>> achieve its best performance in MariaDB 5.5 with respect to preparing
>> and committing transactions, and with respect to binary logging.
>
> Great!
>
>>  - In MariaDB 5.5, we must fsync our recovery log on
>> handlerton->prepare and handlerton->commit
>
> Yes. Without this, binlog and engine may become out-of-sync with each other
> after a crash.
>
>>  - If we want to ensure that the order we commit transactions is the
>> same order that they appear committed in the binary log, then we
>> should perform our commit in two phases, commit_ordered and commit,
>> where commit_ordered is serialized under a global mutex that ensures
>> the correct order
>
> Yes, I think you have the right picture. A detail is that serialisation is
> achieved by running commit_ordered() for many transactions in a simple loop in
> one thread. The end result is much the same, but the single-thread approach
> performs significantly better (I remember blogging about this at some point).
>
>> Here is my big question: given TokuDB is not designed to work with
>> InnoDB's hot backup solution, is there ANY reason why we should care
>> about the commit order of our transactions? Are there any performance
>> reasons related to group commit?
>
> Well, I think the answer you are looking for is "no". I took a lot of care
> when designing this stuff to preserve backwards compatibility in the API, and
> to make the commit_ordered extensions optional. Things should work fine (and
> if not let us know and we will fix it).
>
> One reason to implement it is this:
>
>     https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consistent/
>
> For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data
> backup, but only if the storage engine implements commit_ordered(), similar to
> InnoDB hot backup. But yes, storage engine may not care about this, that is
> fine.
>
> On the other hand, if there is any reason for TokuDB _not_ to implement
> commit_ordered, I would like to know, then maybe I could fix it. After working
> and thinking a lot about these issues, I have realised that there are a lot of
> fundamental benefits to having synchronised commit order between binlog and
> engines. I think you are right that in 5.5, implementing commit_ordered gives
> only small performance improvement, if any. But in future versions there might
> be significant performance gains from implementing commit_ordered, as well as
> usability gains.
>
> Maybe you could give me a short overview of how commit works in TokuDB? I
> imagine most transactional engines have some sort of serialisation around
> commit to reserve a slot in the transaction log or whatever (or does TokuDB
> have multiple independent redo logs?). If such serialisation is done in the
> commit_ordered() call, then it can run in the same thread for multiple
> transactions, which reduces the need for context switches.
>
> So the intension was that it should be trivial for a transactional engine to
> support commit_ordered(). I modified both PBXT and InnoDB for this, and it was
> just a few lines of code to split out the relevant parts of the existing
> commit() method. So if TokuDB is harder, I would like to know.
>
> Of course, there is not much I can do about MySQL 5.6 deciding to implement a
> very similar algorithm with a completely incompatible API, causing much grief
> for 3rd party storage engines :-(
>
>>>From everything I read, I think the answer is no, but I would
>> appreciate confirmation. I think there are no performance gains to be
>> made by implementing commit_ordered, and given we fsync on commit,
>> there should be no issues with syncing our engine up with the binary
>> log after a crash.
>
> Yes, unless you can gain a bit of performance at high concurrency by
> piggybagging your own serialisation needs on the commit_ordered() thread,
> thereby avoiding some context switches. Which is by no means certain.
>
>> Also, I assume the fsync after commit is necessary. Only in 10.0 will
>
> Yes, at least theoretically. It is only necessary to fsync the very last
> commit in the binlog before binlog rotation. However, with no way to know when
> such rotation is about to occur, you are left with no option but to fsync
> always :-(
>
>
>> some new functionality be available that allows us to not fsync on
>> commit.
>
> Yes, so this is a huge performance gain of course. And to use this in 10.0,
> you will need to implement commit_ordered(). Let me explain why, maybe this
> will be useful to you.
>
> So if we crash, we have to do binlog recovery at next startup to be sure that
> we have the exact same set of committed transactions in binlog and in
> engines. To do this we scan the binlog, commit any prepared transactions in
> engines that are in the binlog, and roll back the rest.
>
> Now the issue is - how far back in the binlog should we look? We cannot keep
> binlogs around forever and scan them all at every crash, after all.
>
> So we need some way to know from a storage engine when a commit has been
> fsync()ed to disk and become durable. At this point, we know we no longer need
> to scan for that particular commit in the binlog.
>
> The method used in 5.5 and below is horribly crude: when commit() returns, we
> know that this particular commit has become durable.
>
> In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the
> storage engine to notify us when all commits in the old binlog file have
> become durable. When they have, we no longer need to scan the old binlog file
> at next crash recovery.
>
> By assuming consistent commit order, this is easy. We just call
> commit_checkpoint_request(), storage engine waits until the latest commit has
> become durable, then replies with commit_checkpoint_notify_ha(). I did not
> want to add the overhead of such notification for _every_ commit, only for the
> occasional binlog rotation. Since commits are ordered, we need only ask for
> the last commit in the binlog, then we know all the prior commits are durable
> as well.
>
> So if you want to benefit from no fsync in commit() method in MariaDB 10.0,
> then you need to implement commit_ordered(). And then you might as well do it
> for 5.5 also, the commit_ordered() API is the same.
>
> If this is a problem for you, let me know. Now that I think about it, I
> believe we can perhaps relax the requirement of binlog_checkpoint_request() to
> be that all currently prepared transactions have become durably committed,
> then there is no need to implement commit_ordered.
>
> But the point is, when implementing new performance improvements, we will
> generally assume that commit_ordered is implemented, and not spend effort on
> older engines with no commit_ordered() (except to ensure they still work).
>
>> Is my understanding correct?
>
> Basically yes. Which makes me happy, as that means my existing communications
> on this subject have at least been somewhat adequate :-)
>
> Hopefully my additional details were useful to you as well.
>
>> Thanks
>> -Zardosht
>
> BTW, thanks for taking up these discussions, I really appreciate your thoughts
> and feedback.
>
>  - Kristian.


On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Zardosht Kasheff <zardosht@xxxxxxxxx> writes:
>
>> I am investigating what our storage engine, TokuDB, needs to do to
>> achieve its best performance in MariaDB 5.5 with respect to preparing
>> and committing transactions, and with respect to binary logging.
>
> Great!
>
>>  - In MariaDB 5.5, we must fsync our recovery log on
>> handlerton->prepare and handlerton->commit
>
> Yes. Without this, binlog and engine may become out-of-sync with each other
> after a crash.
>
>>  - If we want to ensure that the order we commit transactions is the
>> same order that they appear committed in the binary log, then we
>> should perform our commit in two phases, commit_ordered and commit,
>> where commit_ordered is serialized under a global mutex that ensures
>> the correct order
>
> Yes, I think you have the right picture. A detail is that serialisation is
> achieved by running commit_ordered() for many transactions in a simple loop in
> one thread. The end result is much the same, but the single-thread approach
> performs significantly better (I remember blogging about this at some point).
>
>> Here is my big question: given TokuDB is not designed to work with
>> InnoDB's hot backup solution, is there ANY reason why we should care
>> about the commit order of our transactions? Are there any performance
>> reasons related to group commit?
>
> Well, I think the answer you are looking for is "no". I took a lot of care
> when designing this stuff to preserve backwards compatibility in the API, and
> to make the commit_ordered extensions optional. Things should work fine (and
> if not let us know and we will fix it).
>
> One reason to implement it is this:
>
>     https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consistent/
>
> For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data
> backup, but only if the storage engine implements commit_ordered(), similar to
> InnoDB hot backup. But yes, storage engine may not care about this, that is
> fine.
>
> On the other hand, if there is any reason for TokuDB _not_ to implement
> commit_ordered, I would like to know, then maybe I could fix it. After working
> and thinking a lot about these issues, I have realised that there are a lot of
> fundamental benefits to having synchronised commit order between binlog and
> engines. I think you are right that in 5.5, implementing commit_ordered gives
> only small performance improvement, if any. But in future versions there might
> be significant performance gains from implementing commit_ordered, as well as
> usability gains.
>
> Maybe you could give me a short overview of how commit works in TokuDB? I
> imagine most transactional engines have some sort of serialisation around
> commit to reserve a slot in the transaction log or whatever (or does TokuDB
> have multiple independent redo logs?). If such serialisation is done in the
> commit_ordered() call, then it can run in the same thread for multiple
> transactions, which reduces the need for context switches.
>
> So the intension was that it should be trivial for a transactional engine to
> support commit_ordered(). I modified both PBXT and InnoDB for this, and it was
> just a few lines of code to split out the relevant parts of the existing
> commit() method. So if TokuDB is harder, I would like to know.
>
> Of course, there is not much I can do about MySQL 5.6 deciding to implement a
> very similar algorithm with a completely incompatible API, causing much grief
> for 3rd party storage engines :-(
>
>>>From everything I read, I think the answer is no, but I would
>> appreciate confirmation. I think there are no performance gains to be
>> made by implementing commit_ordered, and given we fsync on commit,
>> there should be no issues with syncing our engine up with the binary
>> log after a crash.
>
> Yes, unless you can gain a bit of performance at high concurrency by
> piggybagging your own serialisation needs on the commit_ordered() thread,
> thereby avoiding some context switches. Which is by no means certain.
>
>> Also, I assume the fsync after commit is necessary. Only in 10.0 will
>
> Yes, at least theoretically. It is only necessary to fsync the very last
> commit in the binlog before binlog rotation. However, with no way to know when
> such rotation is about to occur, you are left with no option but to fsync
> always :-(
>
>
>> some new functionality be available that allows us to not fsync on
>> commit.
>
> Yes, so this is a huge performance gain of course. And to use this in 10.0,
> you will need to implement commit_ordered(). Let me explain why, maybe this
> will be useful to you.
>
> So if we crash, we have to do binlog recovery at next startup to be sure that
> we have the exact same set of committed transactions in binlog and in
> engines. To do this we scan the binlog, commit any prepared transactions in
> engines that are in the binlog, and roll back the rest.
>
> Now the issue is - how far back in the binlog should we look? We cannot keep
> binlogs around forever and scan them all at every crash, after all.
>
> So we need some way to know from a storage engine when a commit has been
> fsync()ed to disk and become durable. At this point, we know we no longer need
> to scan for that particular commit in the binlog.
>
> The method used in 5.5 and below is horribly crude: when commit() returns, we
> know that this particular commit has become durable.
>
> In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the
> storage engine to notify us when all commits in the old binlog file have
> become durable. When they have, we no longer need to scan the old binlog file
> at next crash recovery.
>
> By assuming consistent commit order, this is easy. We just call
> commit_checkpoint_request(), storage engine waits until the latest commit has
> become durable, then replies with commit_checkpoint_notify_ha(). I did not
> want to add the overhead of such notification for _every_ commit, only for the
> occasional binlog rotation. Since commits are ordered, we need only ask for
> the last commit in the binlog, then we know all the prior commits are durable
> as well.
>
> So if you want to benefit from no fsync in commit() method in MariaDB 10.0,
> then you need to implement commit_ordered(). And then you might as well do it
> for 5.5 also, the commit_ordered() API is the same.
>
> If this is a problem for you, let me know. Now that I think about it, I
> believe we can perhaps relax the requirement of binlog_checkpoint_request() to
> be that all currently prepared transactions have become durably committed,
> then there is no need to implement commit_ordered.
>
> But the point is, when implementing new performance improvements, we will
> generally assume that commit_ordered is implemented, and not spend effort on
> older engines with no commit_ordered() (except to ensure they still work).
>
>> Is my understanding correct?
>
> Basically yes. Which makes me happy, as that means my existing communications
> on this subject have at least been somewhat adequate :-)
>
> Hopefully my additional details were useful to you as well.
>
>> Thanks
>> -Zardosht
>
> BTW, thanks for taking up these discussions, I really appreciate your thoughts
> and feedback.
>
>  - Kristian.


Follow ups

References