We use MariaDB Galera Cluster for our email service platform.

We decided to use Galera to create a high availability platform.

After a year of operation, we start to relaize, that somehow Galera
Failures seem to be the most common cause for outages we had in the

So I wonder if others operating galera clusters also observe this

All our services using DB connections use a DNS round-robin name, to
connect to one of our three galera instances.

While testing this setup, we usualy killed one instance, or
disconnected the node from the network to simulate an outage. In this
situation, this works as expected. The client connect to the two
remaining nodes, no service outage.

When the node is re-started it is being re-synced quickly and service
with three nodes is restored.

Now we experienced a few galera cluster fails, which seem to happen
this way:
One of the nodes is getting a lot of load. DDOS Attacks, Memory Leaks or
similar, which just renders the whole physical machine laggy for a
short time. So the affected MariaDB node is being thrown out of the
cluster by the two other nodes, probably for not syncing fast enough

But as the node is not 'down' completely, it still accepts connections
from the DB clients, but does not reply to them and seems to remain in a
'db locked' situation. Strangely this then also affects the two
remaining nodes, who also go into 'locked' mode and do not reply to
queries on the time expected by the application anymore. Of course this
then causes more DB clients (IMAP, SMTP-Auth, etc) to spawn and to
create DB connections worsening the whole situation.

The situation seemingly can only be resolved by shuting down the
MariaDB node that got thrown out of the cluster. Then the situations
normalizes with the two remaining nodes and the third one can be

Is this expected behaviour? Is there a way to tell a MariaDB node that
got excluded from the cluster to shut himself down completely so it
does NOT accept any more connections from clients, blocking the
whole service?


