← Back to team overview

maria-developers team mailing list archive

Re: Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

 

Marko Mäkelä <marko.makela@xxxxxxxxxxx> writes:

> Do you know how MyRocks implements ACID? There is a function

Sorry, I do not. Maybe Sergey Petrunya does.

> That function does not currently take any parameter (such as THD) to
> identify the transaction of interest, and it cannot indicate that the
> most recent state change of the transaction has already been durably
> written.

The transaction of interest is the last one that called commit_ordered()
prior to this call of commit_checkpoint_request(). I guess binlog code could
save it (it's trivial as commit_ordered() calls are serialised), or InnoDB
could do it itself.

But let's take it from what we actually want to achive. We have created a
new binlog file, and now crash recovery will have to scan the two most
recent binlog files. We want to know when scanning the old binlog file is no
longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
But we don't want to stall waiting for everything to be flushed to disk
immediately; we can just log the CHECKPOINT_EVENT later when appropriate.

We call commit_checkpoint_request() into all storage engines to receive a
notification when all currently (binlog) committed transactions have become
durable. Once the engine has synced past its corresponding LSN, it replies
with commit_checkpoint_notify_ha(), and when all engines have done so, the
CHECKPOINT_EVENT is binlogged.

In 10.4 innobase_checkpoint_request() the code saves the current LSN:

    lsn = log_get_lsn();

An extra check is made if already lsn <= log_get_flush_lsn(); if so,
commit_checkpoint_notify_ha() is called immediately. Else, we check again
whenever the log is flushed, until the recorded lsn has been flushed.

The code in 11.1 is different, though it seems to be doing something
similar.

There seems to be some lock-free operations, I wonder if that makes sense,
this is not timing-critical code (it's called once per binlog rotation).

There's also some complex logic to avoid missing a
commit_checkpoint_notify_ha(). I can see how that is important and could
occur in a completely idle server. But I think it can be done simpler, by
simply scheduling a new check of this every second or so as long as there
are pending checkpoint requests. This is all asynchronous, no point in
trying to report the checkpoint immediately.

If InnoDB is running non-durably, I think there is no point in checking the
log at all. It doesn't help that we scan the required log files if the
transactions to be recoved from InnoDB were not prepared durably and are
unrecoverable anyway. So in non-durable mode InnoDB could just call
commit_checkpoint_notify_ha() immediately?

I notice that RESET MASTER also does a commit_checkpoint_request() and waits
for the notification. I can understand why I did it this way, but in
hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so
we should not wait on it. RESET MASTER surely cannot be time critical, so
probably it would be better to just ask InnoDB to flush everything to its
log, skipping the checkpoint mechanism. Or just do nothing, does it make any
sense to recover the binlog into a consistent state with InnoDB when we just
deleted all of the binlog?!?

Anyway, these are comments on the current code. I still didn't fully
understand what the problem is with the current API wrt. InnoDB?

But I don't see a big problem with changing it either if InnoDB needs
something different. All that is needed from the binlog side is to get a
notification at some point that an old binlog file is no longer needed for
recovery. If we can find a simpler way to do this, then that's good.
Removing the checkpointing from RESET MASTER completely might be a good
start.

> Where should a notification be initiated if all changes have already
> been written at the time handlerton::commit_checkpoint_request is
> called?

Then commit_checkpoint_notify_ha() could be called immediately before
returning from commit_checkpoint_request().

>> Since the binlog writing is essentially sequential in nature, it is more
>> efficient to do it in a single thread for all waiting threads, and this is
>> how the MariaDB binlog group commit is done.
>
> Yes, optimizing that would require a file format change, to replace
> the append-only-to-last-file with something else, such as one or
> multiple preallocated files, possibly arranged as a ring buffer.

Agree, pre-allocated would be more efficient.

> One related idea that has been floating around is to use the InnoDB
> log as a "doublewrite buffer" for short enough binlog event groups.
> The maximum mini-transaction size (dictated by the InnoDB recovery
> code) is something like 4 megabytes. On an InnoDB log checkpoint, any
> buffered binlog event groups would have to be durably written to the
> binlog. Likewise, if a binlog event group is too large to be buffered,
> it would have to be written and fdatasync()ed in advance.

Ah, yes, this is an interesting idea actually.

>> Yes, it is an interesting idea. What do you see as the mechanism to recover
>> a transaction inside InnoDB in case of a crash just after writing to the
>> binlog?
>
> It is simple but potentially costly: Those transactions that were
> recovered in some other state than COMMIT or XA PREPARE will be rolled
> back. Then everything will be replayed from the binlog. InnoDB does

I have an old worklog where this idea was explored in some detail. It is not
trivial, there are many corner cases where replaying binlog events will not
reliably recover the state (eg. statement-based binlogging). People have
learned to live with this for replication, but now that will be extended to
the binlog. One "interesting" corner case is a multi-engine transaction that
ended up durable in one engine but needs replay in another engine. Maybe all
this is tolerable and we can require row-mode binlogging or something, the
gain will be huge for some workloads.

> store the latest committed binlog file name and offset. I do not know
> where exactly that information is being used.

(I think the information with binlog file name and offset is used to
provision a new slave from an innobackup or LVM/BTRFS snapshot?).

 - Kristian.


Follow ups

References