← Back to team overview

maria-discuss team mailing list archive

Galera Cluster: Cluster Blocked, when one node down?

 

Hello

We use MariaDB Galera Cluster for our email service platform.

We decided to use Galera to create a high availability platform.

After a year of operation, we start to relaize, that somehow Galera
Failures seem to be the most common cause for outages we had in the
past.

So I wonder if others operating galera clusters also observe this
situation:

All our services using DB connections use a DNS round-robin name, to
connect to one of our three galera instances.

While testing this setup, we usualy killed one instance, or
disconnected the node from the network to simulate an outage. In this
situation, this works as expected. The client connect to the two
remaining nodes, no service outage.

When the node is re-started it is being re-synced quickly and service
with three nodes is restored.

Now we experienced a few galera cluster fails, which seem to happen
this way:
One of the nodes is getting a lot of load. DDOS Attacks, Memory Leaks or
similar, which just renders the whole physical machine laggy for a
short time. So the affected MariaDB node is being thrown out of the
cluster by the two other nodes, probably for not syncing fast enough
anymore.

But as the node is not 'down' completely, it still accepts connections
from the DB clients, but does not reply to them and seems to remain in a
'db locked' situation. Strangely this then also affects the two
remaining nodes, who also go into 'locked' mode and do not reply to
queries on the time expected by the application anymore. Of course this
then causes more DB clients (IMAP, SMTP-Auth, etc) to spawn and to
create DB connections worsening the whole situation.

The situation seemingly can only be resolved by shuting down the
MariaDB node that got thrown out of the cluster. Then the situations
normalizes with the two remaining nodes and the third one can be
restarted.

Is this expected behaviour? Is there a way to tell a MariaDB node that
got excluded from the cluster to shut himself down completely so it
does NOT accept any more connections from clients, blocking the
whole service?

Regards

-Benoît Panizzon-
-- 
I m p r o W a r e   A G    -    Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29             Tel  +41 61 826 93 00
CH-4133 Pratteln                Fax  +41 61 826 93 01
Schweiz                         Web  http://www.imp.ch
______________________________________________________


Follow ups