← Back to team overview

maria-discuss team mailing list archive

Re: MariaDB server horribly slow on start


On Thu, Jul 28, 2022 at 12:07 PM Cédric Counotte
<cedric.counotte@xxxxxxxxxx> wrote:
> Well, one server crashed twice a few days ago and I've asked my service provided (OVH) to look into it, but they asked me to test the hardware myself, found a NVMe disk with 17000+ errors, still waiting for their feedback on this.

It sounds like you need:
1) ZFS
2) Better monitoring

> Only our 2 oldest servers are experiencing crashes (6 months old only!), and it turns out the RAID NVMe have very different written data, one disk has 58TB (not a replacement) while the other is at 400+TB within the same RAID ! All other servers have identical written data size on both disks of their RAID, so it seems we got used disks and that those are having issues.

Welcome to the cloud. But this is not a bad thing, it's better than
having multiple disks in the same array fail at the same time.
ZFS would help you by catching those errors before the database
ingests them. In normal non-ZFS RAID, it is plausible and even quite
probable that the corrupted data will be loaded from disk and
propagate to other nodes, either via a state transfer or via corrupted
ZFS prevents that by making sure every block's checksum is compared at
read time and any errors that show up get recovered from other
redundant disks.

Under the current circumstances, I wouldn't trust your data integrity
until you run a full extended table check on all tables on all nodes.
And probably pt-table-checksum on all the tables between the nodes to make sure.

> Still didn't have time to produce a crash dump and post an issue with those (to confirm the cause) as I kept having to deal with server restarts trying to reduce the slow issue for 30 minutes to one hour.

you need to be careful with that - state transfer from a node with
failing disks can actually result in the corrupted data propagating to
the node being bootstrapped.

> There was issues with slave thread crashing which I posted an issue and got to update MariaDB to resolve, still there are issues with slave threads stopping without reason so I have written a script to restart it and posted an issue with that.

I don't think you can meaningfully debug anything until you have
verified that your hardware is reliable.
Do your OVH servers have ECC memory?

> The original objective was to have 2 usable cluster in different sites, synched with each other using replication, however all those issues have not allowed us to move forward with this.

With 4 nodes across 2 DCs, you are going to lose writability if you
lose a DC even if it is the secondary DC.
Your writes are also going to be very slow because with 4 nodes, all
writes have to be acknowledged by 3 nodes - and the 3rd node is always
going to be slow because it is connected over a WAN.
I would seriously question whether Galera is the correct solution for you.
And that's on top of writing to multiple nodes which will make things
far worse on top.

> Not to mention the fact that we are now using OVH load balancer and that piece of hardware is sometimes thinking all our servers are down and starts showing error 503 to our customers while our servers are just running fine (no restart, no issue, nothing). So one more issue to deal with, for which we'll get a dedicated server and configure our own load balancer we can have control on.

I think you need to take a long hard look at what you are trying to
achieve and re-assess:
1) Whether it is actually achievable sensibly within the constraints you imposed
2) What the best workable compromise is between what you want and what
you can reasonably have

Right now, I don't think you have a solution that is likely to be workable.

Follow ups