maria-discuss team mailing list archive

Thread
Date

MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade

To: maria-discuss@xxxxxxxxxxxxxxxxxxx
From: "Mark Wadham" <ubuntu@xxxxxx>
Date: Wed, 20 Jul 2016 15:38:25 +0100

Hi,

We have a repeatable failure to initiate IST with MariaDB 10.1.14 afterperforming a schema upgrade on a single node in RSU mode. The errorcondition is when there is a delete query in the format:


delete from <table> where id >= <n>

on the non-RSU cluster nodes while the node is disconnected from thecluster. On rejoining the node determines that it is in sync with theother cluster nodes and no IST is performed, despite the rows that weredeleted in the cluster. If we then delete the rows manually from thejoining node, mysqld immediately crashes on the other nodes because theycan't execute the new write transaction.


The process we followed is:

1. Set up a 3-node cluster, nodes 0,1,2
2. Enable RSU on node 0:

SET GLOBAL wsrep_OSU_method='RSU';

3. Isolate node 0 from the cluster:

SET GLOBAL wsrep_cluster_address="gcomm://";

4. Perform a backward-compatible schema change, since this is the pointof this process. In our test we added a single column to a table with adefault value of null.


Additionally we deleted some rows from a table on nodes 1 and 2, with:

delete from <table> where id >= 100;

which affected around 20 rows.

5. Rejoin the node to the cluster:

SET GLOBAL wsrep_cluster_address="<gcomm string from config file>";

At this point the node immediately rejoins without doing IST andbelieves it is in sync, yet the rows are deleted on nodes 1 and 2 butnot node 0.


Interestingly if the delete query is:

delete from <table> where id = <n>;

there is no problem. Also we have not had any issue with syncing INSERTand UPDATE statements. A combination of INSERT, UPDATE and DELETE whereid >= resulted in the insert/update statements being synced but thedeletes not synced. It is as if the quorum somehow doesn't recognisedelete where id >= as an event.


Our next test cases are:

1. Switching node 0 back to TOI mode before rejoining the cluster,although I can't really see how this would make a difference.


2. Upgrading to MariaDB 10.1.16 which was released a couple of days ago.

3. Testing whether regular IST is affected, ie IST that should occurnormally without switching to RSU mode or dropping a node out of thecluster.

This seems like a pretty basic failure and I'm concerned that it mayalso affect regular IST, i.e. a node falling behind the cluster fornormal reasons without any involvement of RSU mode, which wouldeffectively make the whole system useless if it could randomly dropdelete statements.

If anyone can shed any light on why this may be happening we would bevery grateful!


Thanks,
Mark

Follow ups

Re: MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade
From: Nirbhay Choubey, 2016-07-20
Re: MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade
From: Mark Wadham, 2016-07-20