maria-developers team mailing list archive

Thread
Date
Re: Semisync plugin incompatibility

To: Pavel Ivanov <pivanof@xxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 11 Nov 2013 11:29:30 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAAG=WUvsqNV2LV3D2F4H+eFn6nNq9-rsyN8Nc-Hcac6VxunExg@mail.gmail.com> (Pavel Ivanov's message of "Sat, 9 Nov 2013 17:00:49 -0800")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Pavel Ivanov <pivanof@xxxxxxxxxx> writes:

> We've noticed recently that semisync_master plugin in MariaDB (which
> apparently was fully inherited from MySQL) is seriously incompatible
> with our understanding of the purpose of semi-sync replication. This
> incompatibility was apparently introduced as a fix for
> http://bugs.mysql.com/bug.php?id=45672. The "major no-no" that bug

So as I understand it, this bug is about what should happen when semisync is
enabled, but no slaves are connected.

Apparently before the fix of Bug#45672, an error was thrown late during
COMMIT. So the transaction was committed (locally on the master), but the
client still got an error back.

And if I understand correctly, after the fix of Bug#45672, no error is thrown
in the case where no slave is connected.

> talks about is in our opinion the whole purpose of semi-sync
> replication -- if transaction is not replicated to at least one slave
> client shouldn't get OK even if transaction is committed locally on
> the master. Also master shouldn't just turn off semi-sync replication
> whenever it wants.

So with "just turn off semi-sync replication whenever it wants" - what are you
refering to here? I seem to remember that semisync has a timeout, and it gets
disabled if that timeout triggers? My guess is that this is what you have in
mind, but I wanted to ask to make sure ...

> We will fix this problem for us, but first I wanted to understand
> what's your view of the purpose of semi-sync replication and how you
> think it should work? I need to know your opinion to understand how I
> should fix this issue...

Well, personally, I never was much interested in semi-sync. But it is my
understanding that there is some interest, so I will answer with what small
opinion I have.

I suppose the general idea is that when client sees its COMMIT complete, it
can know that its transaction exists in at least two places (master binlog +
at least one slave relay log). So there is no longer any single point of
failure that can cause loss of the transaction.

Another point of view I is that semi-sync provides some sort of throttle on
how fast the master can generate events compared to how fast the slaves can
receive them:

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/#comment-878447

There was also a suggestion (and a patch is floating around somewhere) for
"enhanced semisync replication":

    https://mariadb.atlassian.net/browse/MDEV-162

This delays not only client acknowledge but also InnoDB commit until the ack
from at least one slave, which means that transactions are not visible to
other clients until they exist on at least one slave in addition to on the
master.

Since this is _semi_-sync, not real two-phase commit synchronous replication,
the main problem is that there is way to ensure consistency in the general
error case. The transaction is already fully committed on the master, it
cannot be rolled back. So we are left with the choice of one of two evils:

1. Report an error to the client. Most clients would then probably wrongly
assume that the transaction was _not_ committed. There also does not seem to
be much the client can do about the error except perhaps log an incident to
the monitoring system. On the other hand, then at least the problem is not
silently ignored.

2. Report success to the client but complain loudly in the error log (I assume
this is what happens in current code). This leaves the client unaware that
there is a problem (but presumably the monitoring system will catch the
message in the error log).

>From this summary, I think I can see the logic of the current behaviour:

 - It preserves protection against single-point-of-failure. If all slaves are
   gone, then we already have one failure, and unless we experience a double
   failure (master also failing before slave recovers), the transaction will
   eventually be sent to a slave and no overall failure happens.

 - If the client can anyway not do anything about the problem except notify
   the monitoring system, the server may as well do the notification itself.

But the opposite point of view also has merit. The client asked for semi-sync
behavior, but did not get it, and it does not even have a way to know about
the problem. That is not good.

Does the client currently at least get a warning for the COMMIT? I think it
should (eg. the fix for Bug#45672 should at least have been to turn the error
into a warning, not remove the error completely).

What I think could make sense is if the client got an error during the prepare
phase if no slaves are connected. In this case we _can_ roll back the
transaction and give an error to the client without any issue of
consistency. But it still leaves a small window where the last slave can
disappear between the prepare and the commit phase and leave us with the
original problem.

I hope this helps you ... Maybe you can describe your use-case, and how you
need things to work for that case? Personally I have nothing against changing
this behaviour to something more logical, I am just not sure what the most
logical behaviour is ...

 - Kristian.
Follow ups

Re: Semisync plugin incompatibility
From: Pavel Ivanov, 2013-11-11