← Back to team overview

maria-developers team mailing list archive

Re: understanding commit and prepare APIs in MariaDB 5.5

 

Zardosht Kasheff <zardosht@xxxxxxxxx> writes:

> Reading the email, I think this is what is happening. You depend on
> commit_ordered to order the transactions in the engine, and when the
> binary log is going to rotate, you call commit_checkpoint_request on
> the last transaction in that order. When that returns, we know all
> transactions in the binlog have been committed to disk and the binary
> log may be rotated.
>
> Is this accurate?

Close, but not quite.

We do not wait for anything before rotating the binlog, as that would
unnecessarily stall subsequent commits. But we do ask the storage engines to
let us know when all transactions in the previous log file have been durably
committed. Until then, we need to scan two binlog files in case of crash
recovery, the old one and the new one. Once the storage engines tell us that
everything is durable, we write a marker in the new log that the old log is no
longer needed.

The implementation and API is quite asynchroneous in this respect.

> If so, then perhaps the ordering is adding an unnecessary constraint.

Yes, I think you are right. You have to understand, when I implemented this, I
did not really worry about storage engines that do not implement
commit_ordered(), because the intention is that all up-to-date engines will
want to do this anyway. So it looks easy to make this particular feature work
without commit_ordered(), I just did not consider it before.

> How would the following work:
>  - when the binary log is to be rotated, wait for all transactions
> that are in the process of committing to commit.

I do not want to do this, as it introduces unnecessary stalling.

>  - call each handlerton to ensure all committed transactions are
> durable. For TokuDB, this would mean fsyncing our recovery log. In

We can still do this.

The contract around commit_checkpoint_request() is that storage engine must
not reply until all transactions that have returned from commit_ordered() have
become durable. If you do not implement commit_ordered(), then this is hard in
the engine, because commit() may not have been called yet for one of your
transactions to become durable.

But instead, you can look at all transactions that have returned from
prepare(). Any transaction that has reached commit_ordered() will first have
done prepare(). Or even just all transactions that have started at all! So
just wait until any transaction that has been prepared has durably committed
(or been durably rolled back). At that point, invoke
commit_checkpoint_notify_ha(). It does not matter if it takes long before
this. Any delay has no worse consequences than having to scan a bit more of
the binlog if we crash.

For example, maybe you can just wait for your next checkpoint to complete, and
invoke commit_checkpoint_notify_ha() at that time, assuming checkpoint makes
transactions durable.

We do not have to change anything in the MariaDB code for this to work, just
update the comments defining the contract between server and storage
engine. It is just a matter of ensuring that commit_checkpoint_notify_ha() is
only called after any transaction has been made durable that might have been
written to the binlog before commit_checkpoint_request() was called.

Does this sound reasonable?

> MySQL 5.6, we intend to use the flush logs command to do this.

Yes, MySQL 5.6 does not allow new commits to proceed while waiting for old
binlog to be rotated.

 - Kristian.


Follow ups

References