maria-discuss team mailing list archive
-
maria-discuss team
-
Mailing list archive
-
Message #03760
MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade
Hi,
We have a repeatable failure to initiate IST with MariaDB 10.1.14 after
performing a schema upgrade on a single node in RSU mode. The error
condition is when there is a delete query in the format:
delete from <table> where id >= <n>
on the non-RSU cluster nodes while the node is disconnected from the
cluster. On rejoining the node determines that it is in sync with the
other cluster nodes and no IST is performed, despite the rows that were
deleted in the cluster. If we then delete the rows manually from the
joining node, mysqld immediately crashes on the other nodes because they
can't execute the new write transaction.
The process we followed is:
1. Set up a 3-node cluster, nodes 0,1,2
2. Enable RSU on node 0:
SET GLOBAL wsrep_OSU_method='RSU';
3. Isolate node 0 from the cluster:
SET GLOBAL wsrep_cluster_address="gcomm://";
4. Perform a backward-compatible schema change, since this is the point
of this process. In our test we added a single column to a table with a
default value of null.
Additionally we deleted some rows from a table on nodes 1 and 2, with:
delete from <table> where id >= 100;
which affected around 20 rows.
5. Rejoin the node to the cluster:
SET GLOBAL wsrep_cluster_address="<gcomm string from config file>";
At this point the node immediately rejoins without doing IST and
believes it is in sync, yet the rows are deleted on nodes 1 and 2 but
not node 0.
Interestingly if the delete query is:
delete from <table> where id = <n>;
there is no problem. Also we have not had any issue with syncing INSERT
and UPDATE statements. A combination of INSERT, UPDATE and DELETE where
id >= resulted in the insert/update statements being synced but the
deletes not synced. It is as if the quorum somehow doesn't recognise
delete where id >= as an event.
Our next test cases are:
1. Switching node 0 back to TOI mode before rejoining the cluster,
although I can't really see how this would make a difference.
2. Upgrading to MariaDB 10.1.16 which was released a couple of days ago.
3. Testing whether regular IST is affected, ie IST that should occur
normally without switching to RSU mode or dropping a node out of the
cluster.
This seems like a pretty basic failure and I'm concerned that it may
also affect regular IST, i.e. a node falling behind the cluster for
normal reasons without any involvement of RSU mode, which would
effectively make the whole system useless if it could randomly drop
delete statements.
If anyone can shed any light on why this may be happening we would be
very grateful!
Thanks,
Mark
Follow ups