maria-developers team mailing list archive

Thread
Date

Re: MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Marko Mäkelä <marko.makela@xxxxxxxxxxx>
Date: Fri, 2 Nov 2018 10:45:00 +0200
Cc: "MariaDB Developers \(maria-developers@xxxxxxxxxxxxxxxxxxx\)" <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <874ld0pr1v.fsf@urd.knielsen-hq.org>

Hi Kristian,
Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:
>
> Marko Mäkelä <marko.makela@xxxxxxxxxxx> writes:
>
> > Could you please take a look at this InnoDB regression in MariaDB 10.2
> > and later:
> > https://jira.mariadb.org/browse/MDEV-15740
> >
> > Because of this bug, we can no longer write tests that would ensure
> > that certain changes are present in the redo log, before killing and
>
> But what is the bug here? This is a Galera-only issue, right?

I was under the impression that it affects normal InnoDB too.
But indeed, it looks like you could be right that it is only affecting Galera.

I am sorry for blaming you; I think that the test innodb.innodb_force_recovery
proves that the redo log will be flushed correctly in 10.2 as well as 10.4.
I also tested with a variant that relies on the commit of DML (INSERT)
instead of DDL (DROP TABLE) for flushing the redo log.
I have now reassigned MDEV-15740 as a Galera bug.

> It is not a bug that commit does not flush the log to disk when
> trx->active_commit_ordered, it's an important performance optimisation. The
> transaction is already durable at this point (it was synced in prepare, and
> will be recovered by the transaction coordinator during crash recovery, eg.
> from the binlog).
>
> It's the same in MySQL, just using a different mechanism
> (thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)).

Thanks, this makes sense now.

> Is the problem just about getting the test cases to work? That's easy to do,
> for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some
> appropriate place.

I prefer to have tests that run on non-debug binaries, and
innodb_flush_log_at_trx_commit=1 should trigger a redo log write
(which will also persist any incomplete transactions from concurrent sessions).

For those cases where we need debug instrumentation, I prefer DEBUG_SYNC.
For example, if I have to kill the server at a specific point, I prefer to do it
externally, like this:

connection other;
SET DEBUG_SYNC = 'some_point SIGNAL hung WAIT_FOR ever';
…
connection default;
SET DEBUG_SYNC = 'now WAIT_FOR hung';
--source include/kill_mysqld.inc
disconnect other;
--source include/start_mysqld.inc

In this way, the test will not cause any memory leak errors in Valgrind or ASAN.

> Or is the problem that Galera somehow disables crash recovery from the
> binlog, and requires an fsync during every commit to be crash safe, not just
> for test-purposes? In that case, isn't the problem the same in MySQL? Then
> it sounds like the problem is that Galera in MySQL does something to avoid
> HA_IGNORE_DURABILITY being set, and this "something" was not merged
> correctly to MariaDB to work within the commit_ordered() framework?
>
> Galera needs to either respect the commit_ordered() semantics (see eg.
> comments in sql/handler.h). Or it needs to take over the transaction
> coordinator role and control commit_ordered() itself. The latter is the
> correct approach (but is not done currently as far as I am aware).
>
> In general, Galera has never been properly integrated with the transaction
> coordinator / commit_ordered() framework in MariaDB, and thus we keep
> getting issues like this popping up. But without knowing how Galera intends
> this to work, it's hard to say what the real problem is...

I have not participated in Galera development, so I cannot say either.

> > -           || thd_requested_durability(trx->mysql_thd)
> > -              == HA_IGNORE_DURABILITY) {
> > +           || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
> >
> > This looks particularly strange because changing
> > innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the
> > condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1
> > is supposed to be more durable and slow compared to 2. Now, with the
> > current code 2 is more durable!
>
> It could be that this could instead use srv_flush_log_at_trx_commit>=1 to
> also avoid an unnecessary log flush (without fsync) to disk. I didn't check
> now, as I think this is not the question at hand here?

Yes, it could be changed to >=1 as part of the Galera fix.

> > Related to this, I discussed an idea with Andrei Elkin: Actually there
> > should be no need to flush the InnoDB redo log at transaction commit if
> > logical logging is enabled. On recovery, we could retrieve the latest
> > committed logical log position from InnoDB, and then replay the logical
> > log from that point of time. We would only have to wait for InnoDB redo
> > log write before discarding the head of the logical log. For this, we
> > could implement some API call like "flush logs up to the LSN corresponding
> > to this binlog position".
>
> It's an interesting idea, I had the same one 8 years ago ;-)
>
>   http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164
>
> (I think there may be a jira corresponding to this MWL#164, but not sure how
> to find it).
>
> It would be really nice to see this implemented. There are a few
> complications though, eg. multi-engine transactions. And it still requires
> that storage engines (/Galera) respect the commit_ordered() semantics.

Yes, I believe that multi-engine transactions would still need a XA
2PC mechanism.
In that case, it seems to me that the binlog would have to continue to
do extra fsync() at all of XA PREPARE, XA COMMIT, XA ROLLBACK.

> If you can describe in more detail what the problem is, I can look a bit
> deeper?

I think that I must carefully debug the issue that I observed in 10.4
to see what exactly is wrong there.
I was making a false assumption, but luckily as a result of this, we
got MDEV-15740 properly recategorized as a Galera bug.

Thanks for the help!

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation

Follow ups

Re: MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)
From: Kristian Nielsen, 2018-11-02

References

MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)
From: Marko Mäkelä, 2018-11-01
Re: MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)
From: Kristian Nielsen, 2018-11-01