maria-discuss team mailing list archive
Mailing list archive
Re: Queries Not Completing After Backup
do you run wsrep_desync=ON on the node before running the backup? It seems
like a case of flow control triggering.
On Fri, Dec 11, 2015 at 1:29 AM Brad Jorgensen <brad@xxxxxxxxxxxxxx> wrote:
> We have a three node (db1, db2, db3) galera cluster with MariaDB 10.0.22
> on CentOS 6.7. A couple days ago I upgraded to 10.1.9. Xtrabackup
> (2.3.2) is run every night on each node at 1am, 2am, and 3am
> respectively. Before the backup starts, the node is desynced.
> The first night after upgrading to 10.1.9, the problem began. All
> connections were going to db1 until the backup started when db1 was
> removed from the routing pool and new connections began going to db2.
> At that time there was little traffic aside from the backup; much of it
> is probably monitoring queries. Our monitoring shows that running
> threads went from about 2 just before the backup finished around 1:32am
> to about 150 just after. At the same time, the running threads on db2
> went from 1 to 10. After the backup completed, all new connections were
> going to db1 again. The running threads on db1 continued to slowly grow
> until the queries that are stuck took up all of the server processes on
> our application servers and we were alerted around 3:50am. I checked
> the process list and almost all of the queries were in the "query end"
> state and I think they were all write queries. I tried to kill most of
> them but they just stayed in the same state. I restarted db2 to try to
> kick the cluster without losing data. I had to force the shutdown since
> three threads never ended after about 10 minutes of waiting. The
> running threads on db1 returned to normal. db2 had do do a full SST
> which took until 6:05 to complete. At that time, the running processes
> on db1 began to increase again. When db2 was back up I downgraded to
> 10.1.22 and rejoined it to the cluster. I tried to restart db1, but it
> needed a full SST so I left it down. A bit later I took down db3 to
> downgrade it, too at that went fine. The cluster was fine through the
> day during normal business operation.
> The next night only db2 and db3 were up and were running 10.0.22. What
> appears to be the same problem started at 3:31am, when xtrabackup paused
> galera ("Provider paused at
> 8c53b634-9514-11e4-b8bd-dab05673fb36:875650526") on db3 for the backup.
> At that time the running threads on db2 shot up and slowly increased
> until I shut it down at 6:28. I had to kill it again due to three
> threads on ending. db3 showed nothing unusual in the logs. I got the
> innodb engine status from db2 three times a few minutes apart before I
> restarted; they are attached.
> Additionally, I attached an excerpt from the logs on db2 and db3 during
> the second incident and the my.cnf from one of the servers, it's
> basically the same for the others. I'm working on getting a clean set
> of logs from the first incident, but from what I initially saw, they are
> basically the same as the second set of logs. I'm ready if the problem
> arises again and I'll try to get more information including SHOW GLOBAL
> Our environment hasn't changed for at least a month and the issue first
> appeared after upgrading to 10.1.9, but since it didn't go away after
> downgrading, I'm not sure where the issue is.
> I found a few mentions of what might be the same problem:
> Mailing list: https://launchpad.net/~maria-discuss
> Post to : maria-discuss@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help : https://help.launchpad.net/ListHelp
Remote DBA Services Manager