← Back to team overview

maria-developers team mailing list archive

Re: Review of patch for MDEV-4820

 

OK, I performed some quick testing of the latest 10.0-base. I see a
few points I'm unhappy with at the moment. These are not necessarily
related to MDEV-4820, I probably should file new bugs for these. I can
do that later if you want me to do that.

1. When master doesn't have binlogs and gtid_slave_pos is ahead of the
GTID that slave tries to connect with you give error "The binlog on
the master is missing the GTID ... requested by the slave (even though
both a prior and a subsequent number does exist), and GTID strict mode
is enabled". I find this error message very confusing: presence of a
subsequent GTID in such situation is questionable, but there is no
prior GTID in master's binlog for sure.

2. The error message "An attempt was made to binlog GTID ... which
would create an out-of-order sequence number with existing GTID ...,
and gtid strict mode is enabled" is confusing too, because it's issued
not when slave actually tries to write event to binlog. Apparently the
error condition is checked when slave considers executing the event
that was just received from master. And if this event contains changes
only to tables matching replicate-wild-ignore-table filter then this
event won't be ever binlog'ed on slave in non-strict mode. So there's
no "attempt to binlog" involved and error wording becomes not quite
understandable.

3. There's error message "Specified GTID ... conflicts with the binary
log which contains a more recent GTID .... If
MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override
the new value of @@gtid_slave_pos". It looks like it's issued
inconsistently. I had in binlog empty Gtid_list, then 0-1-26, 0-1-27,
0-1-28, 0-2-29 and 0-2-30. And both gtid_slave_pos and gtid_binlog_pos
were set to '0-2-30'. In this situation I was able to set
gtid_slave_pos to '0-1-29' successfully and get "slave has diverged"
error after START SLAVE. Then I was able to set gtid_slave_pos to
'0-2-29' and get error "Attempt was made to binlog out-of-order" after
START SLAVE.
I'd think that at least in strict mode MariaDB shouldn't allow to set
gtid_slave_pos to a value that is clearly in the past.

4. Now real bug. Start three servers S1, S2 and S3 without binlogs.
Set gtid_slave_pos to the same value on all of them. Connect S2 to
replicate from S1. Execute a few transactions on S1. Perform a
failover, make S1 to replicate from S2. Now connect S3 to replicate
from S2. At this point S3 should be able to replicate successfully
because it has the same db state as S2 had in the beginning (S3 has
the same gtid_slave_pos as S2 had initially), and S2 has all binlogs
to move from current position on S3 to the current position on S2. But
yet S3 gets error that starting GTID doesn't exist in S2's binlogs.

I think to fix this bug we should stop using gtid_slave_pos as
indication of the current db state. We should make it possible to
change gtid_binlog_pos when there's no events in binlogs. And when
gtid_binlog_pos is changed we should force binlog rotation so that we
have Gtid_list with initial value of gtid_binlog_pos. Then
gtid_binlog_pos could be always used for setting initial db state and
it kind of makes sense more than using gtid_slave_pos. But probably
this will break the detection of slaves trying to connect using GTID
before the start of binlogs...

5. Completely from different area but also GTID related bug. Take
database from previous MySQL version (I've tested on the database from
5.1), start MariaDB on it, run mysql_upgrade and then try to set
gtid_slave_pos to something. At this point I've got error "unable to
load slave state from gtid_slave_pos table". This error was apparently
remembered from MariaDB's start and reading of gtid_slave_pos table
wasn't retried after mysql_upgrade actually created it.


Pavel


On Fri, Aug 16, 2013 at 6:27 AM, Kristian Nielsen
<knielsen@xxxxxxxxxxxxxxx> wrote:
> Ok, I've pushed to 10.0-base a patch for MDEV-4820.
>
> revid:knielsen@xxxxxxxxxxxxxxx-20130816131025-etjrvmfvupsjzq83
>
> As far as I can determine (and I checked quite carefully), this fixes all the
> problems you mentioned in the bug description and in your test cases. But I
> could have misunderstood something.
>
> Note that for the problem "For some reason at this point server 1 doesn't have
> any errors and doesn't replicate anything from server 2. Oops", the error is
> caught not when slave connects, but instead when the first event is received,
> which should be just as good. The reason is briefly explained in the changeset
> comment, and is to not re-introduce the bug MDEV-4485.
>
> The error message for "alternate future" I formulated like this:
>
> "Connecting slave requested to start from GTID %u-%u-%llu, which is not in the
> master's binlog. Since the master's binlog contains GTIDs with higher sequence
> numbers, it probably means that the slave has diverged due to executing extra
> errorneous transactions"
>
> I did not want to use the term "alternate future" as this seems to be not
> standard terminology. The MySQL manual uses the related term "diverge".
>
> I am not sure if you will be happy with the fix, but if not, please explain
> clearly if
>
> 1. You observe incorrect behavior (eg. lost transactions, alternate future not
>    caught by error), and if so describe as clearly as possible how to
>    reproduce; or
>
> 2. The behaviour is correct, but you are unhappy about the wording of the
>    error messages, or how the code is implemented.
>
>  - Kristian.
>
> PS. I hope it is clear that I greatly value your feedback. You and Elena are
> the only ones who have seriously worked to help improve the MariaDB GTID, and
> your input has already been very valuable.


Follow ups

References