← Back to team overview

maria-discuss team mailing list archive

Re: Queries Not Completing After Backup

 

Hi Brad,

do you run wsrep_desync=ON on the node before running the backup? It seems
like a case of flow control triggering.

On Fri, Dec 11, 2015 at 1:29 AM Brad Jorgensen <brad@xxxxxxxxxxxxxx> wrote:

> We have a three node (db1, db2, db3) galera cluster with MariaDB 10.0.22
> on CentOS 6.7.  A couple days ago I upgraded to 10.1.9.  Xtrabackup
> (2.3.2) is run every night on each node at 1am, 2am, and 3am
> respectively.  Before the backup starts, the node is desynced.
>
> The first night after upgrading to 10.1.9, the problem began.  All
> connections were going to db1 until the backup started when db1 was
> removed from the routing pool and new connections began going to db2.
> At that time there was little traffic aside from the backup; much of it
> is probably monitoring queries.  Our monitoring shows that running
> threads went from about 2 just before the backup finished around 1:32am
> to about 150 just after.  At the same time, the running threads on db2
> went from 1 to 10.  After the backup completed, all new connections were
> going to db1 again.  The running threads on db1 continued to slowly grow
> until the queries that are stuck took up all of the server processes on
> our application servers and we were alerted around 3:50am.  I checked
> the process list and almost all of the queries were in the "query end"
> state and I think they were all write queries.  I tried to kill most of
> them but they just stayed in the same state.  I restarted db2 to try to
> kick the cluster without losing data.  I had to force the shutdown since
> three threads never ended after about 10 minutes of waiting.  The
> running threads on db1 returned to normal.  db2 had do do a full SST
> which took until 6:05 to complete.  At that time, the running processes
> on db1 began to increase again.  When db2 was back up I downgraded to
> 10.1.22 and rejoined it to the cluster.  I tried to restart db1, but it
> needed a full SST so I left it down.  A bit later I took down db3 to
> downgrade it, too at that went fine.  The cluster was fine through the
> day during normal business operation.
>
> The next night only db2 and db3 were up and were running 10.0.22.  What
> appears to be the same problem started at 3:31am, when xtrabackup paused
> galera ("Provider paused at
> 8c53b634-9514-11e4-b8bd-dab05673fb36:875650526") on db3 for the backup.
> At that time the running threads on db2 shot up and slowly increased
> until I shut it down at 6:28.  I had to kill it again due to three
> threads on ending.  db3 showed nothing unusual in the logs.  I got the
> innodb engine status from db2 three times a few minutes apart before I
> restarted; they are attached.
>
> Additionally, I attached an excerpt from the logs on db2 and db3 during
> the second incident and the my.cnf from one of the servers, it's
> basically the same for the others.  I'm working on getting a clean set
> of logs from the first incident, but from what I initially saw, they are
> basically the same as the second set of logs.  I'm ready if the problem
> arises again and I'll try to get more information including SHOW GLOBAL
> STATUS.
>
> Our environment hasn't changed for at least a month and the issue first
> appeared after upgrading to 10.1.9, but since it didn't go away after
> downgrading, I'm not sure where the issue is.
>
> I found a few mentions of what might be the same problem:
> http://marialog.archivist.info/2015-04-03.txt
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1149755
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
>
-- 
Guillaume Lefranc
Remote DBA Services Manager
MariaDB Corporation

References