← Back to team overview

maria-developers team mailing list archive

Re: MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)

 

Marko Mäkelä <marko.makela@xxxxxxxxxxx> writes:

> Could you please take a look at this InnoDB regression in MariaDB 10.2
> and later:
> https://jira.mariadb.org/browse/MDEV-15740
>
> Because of this bug, we can no longer write tests that would ensure
> that certain changes are present in the redo log, before killing and

But what is the bug here? This is a Galera-only issue, right?

It is not a bug that commit does not flush the log to disk when
trx->active_commit_ordered, it's an important performance optimisation. The
transaction is already durable at this point (it was synced in prepare, and
will be recovered by the transaction coordinator during crash recovery, eg.
from the binlog).

It's the same in MySQL, just using a different mechanism
(thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)).

Is the problem just about getting the test cases to work? That's easy to do,
for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some
appropriate place.

Or is the problem that Galera somehow disables crash recovery from the
binlog, and requires an fsync during every commit to be crash safe, not just
for test-purposes? In that case, isn't the problem the same in MySQL? Then
it sounds like the problem is that Galera in MySQL does something to avoid
HA_IGNORE_DURABILITY being set, and this "something" was not merged
correctly to MariaDB to work within the commit_ordered() framework?

Galera needs to either respect the commit_ordered() semantics (see eg.
comments in sql/handler.h). Or it needs to take over the transaction
coordinator role and control commit_ordered() itself. The latter is the
correct approach (but is not done currently as far as I am aware).

In general, Galera has never been properly integrated with the transaction
coordinator / commit_ordered() framework in MariaDB, and thus we keep
getting issues like this popping up. But without knowing how Galera intends
this to work, it's hard to say what the real problem is...

> -           || thd_requested_durability(trx->mysql_thd)
> -              == HA_IGNORE_DURABILITY) {
> +           || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
>
> This looks particularly strange because changing
> innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the
> condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1
> is supposed to be more durable and slow compared to 2. Now, with the
> current code 2 is more durable!

It could be that this could instead use srv_flush_log_at_trx_commit>=1 to
also avoid an unnecessary log flush (without fsync) to disk. I didn't check
now, as I think this is not the question at hand here?

> Related to this, I discussed an idea with Andrei Elkin: Actually there
> should be no need to flush the InnoDB redo log at transaction commit if
> logical logging is enabled. On recovery, we could retrieve the latest
> committed logical log position from InnoDB, and then replay the logical
> log from that point of time. We would only have to wait for InnoDB redo
> log write before discarding the head of the logical log. For this, we
> could implement some API call like "flush logs up to the LSN corresponding
> to this binlog position".

It's an interesting idea, I had the same one 8 years ago ;-)

  http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164

(I think there may be a jira corresponding to this MWL#164, but not sure how
to find it).

It would be really nice to see this implemented. There are a few
complications though, eg. multi-engine transactions. And it still requires
that storage engines (/Galera) respect the commit_ordered() semantics.

If you can describe in more detail what the problem is, I can look a bit
deeper?

Hope this helps,

 - Kristian.


Follow ups

References