maria-discuss team mailing list archive

Thread
Date

Re: MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade

To: Mark Wadham <ubuntu@xxxxxx>
From: Nirbhay Choubey <nirbhay@xxxxxxxxxxx>
Date: Wed, 20 Jul 2016 17:31:22 -0400
Cc: MariaDB discuss <maria-discuss@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <A6747FD7-9015-49D8-8598-94066F3430EF@rkw.io>

Hi Mark,

On Wed, Jul 20, 2016 at 10:38 AM, Mark Wadham <ubuntu@xxxxxx> wrote:

> Hi,
>
> We have a repeatable failure to initiate IST with MariaDB 10.1.14 after
> performing a schema upgrade on a single node in RSU mode.  The error
> condition is when there is a delete query in the format:
>
> delete from <table> where id >= <n>
>
> on the non-RSU cluster nodes while the node is disconnected from the
> cluster.  On rejoining the node determines that it is in sync with the
> other cluster nodes and no IST is performed, despite the rows that were
> deleted in the cluster.  If we then delete the rows manually from the
> joining node, mysqld immediately crashes on the other nodes because they
> can't execute the new write transaction.
>
> The process we followed is:
>
> 1. Set up a 3-node cluster, nodes 0,1,2
> 2. Enable RSU on node 0:
>
> SET GLOBAL wsrep_OSU_method='RSU';
>
> 3. Isolate node 0 from the cluster:
>
> SET GLOBAL wsrep_cluster_address="gcomm://";
>
> 4. Perform a backward-compatible schema change, since this is the point of
> this process.  In our test we added a single column to a table with a
> default value of null.
>

As discussed on IRC #mariadb, you do not really need to take the node off
cluster (3).
Just set wsrep_osu_method's session value to RSU and perform the schema
change.
With RSU mode enabled, the node automatically desyncs itself from the
cluster before
executing any DDL,and thus other nodes in the cluster are not impacted.

Best,
Nirbhay


>
> Additionally we deleted some rows from a table on nodes 1 and 2, with:
>
> delete from <table> where id >= 100;
>
> which affected around 20 rows.
>
> 5. Rejoin the node to the cluster:
>
> SET GLOBAL wsrep_cluster_address="<gcomm string from config file>";
>
> At this point the node immediately rejoins without doing IST and believes
> it is in sync, yet the rows are deleted on nodes 1 and 2 but not node 0.
>
> Interestingly if the delete query is:
>
> delete from <table> where id = <n>;
>
> there is no problem.  Also we have not had any issue with syncing INSERT
> and UPDATE statements.  A combination of INSERT, UPDATE and DELETE where id
> >= resulted in the insert/update statements being synced but the deletes
> not synced.  It is as if the quorum somehow doesn't recognise delete where
> id >= as an event.
>
> Our next test cases are:
>
> 1. Switching node 0 back to TOI mode before rejoining the cluster,
> although I can't really see how this would make a difference.
>
> 2. Upgrading to MariaDB 10.1.16 which was released a couple of days ago.
>
> 3. Testing whether regular IST is affected, ie IST that should occur
> normally without switching to RSU mode or dropping a node out of the
> cluster.
>
>
> This seems like a pretty basic failure and I'm concerned that it may also
> affect regular IST, i.e. a node falling behind the cluster for normal
> reasons without any involvement of RSU mode, which would effectively make
> the whole system useless if it could randomly drop delete statements.
>
> If anyone can shed any light on why this may be happening we would be very
> grateful!
>
> Thanks,
> Mark
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

Re: MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade
From: Mark Wadham, 2016-07-21

References

MariaDB 10.1.14 failure to initiate SST after RSU schema upgrade
From: Mark Wadham, 2016-07-20