graphite-dev team mailing list archive

Thread
Date

[Question #287525]: Consistent carbon-cache hitting MAX_CACHE_SIZE

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: Charles <question287525@xxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 02 Mar 2016 18:57:36 -0000
Reply-to: question287525@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

New question #287525 on Graphite:
https://answers.launchpad.net/graphite/+question/287525

While this looks similar to https://answers.launchpad.net/graphite/+question/285063 - I'm reaching out here to see if anyone has seen similar behavior or any advice/tips.

Relevant information:

0.9.15 graphite-web/carbon/whisper
1T SSD storage for whisper files
8 core machine, 32G memory

3 relays using carbon-c-relay forward to send to 1 server running two carbon_cache. Initially using carbon-relay on the storage server, switched out to carbon-c-relay using carbon_ch, and now using any_of to split metrics between the instances.

~2 million metrics/minute

About 2 weeks ago, an influx of metrics happened all at once, and ever since then (after turned off what was shipping that influx of metrics a few days later), carbon-cache-a has increasing its cache size, eventually causing the machine to start swapping eventually. I've since put in a limit of MAX_CACHE, which the instance hits within a few minutes. I've attempting turning dirty pages, changing the disk scheduler from deadline to noop, and turning on WHISPER_AUTOFLUSH, all without seeing any improvements.

Disk utilization was about 50-90%, CPU usage was not maxed. Carbon-cache-b had no issues whatsoever. Checking carbon-agent's committed points, it looks like carbon-cache-a is doing 1/4 of the committed points of carbon-cache-b without the server being overloaded. Looking at carbon-c-relay's carbon metrics, each carbon-cache instance is getting the same amount of metrics.

Flash forward to yesterday. I've now clustered out the metrics to two servers, running four carbon-cache instances now (was running 2, saw the same issue. New server was hitting 100% cpu on the caches, so I bumped both to use 4).

3 relays using carbon_ch to ship to the two servers, each server using any_of to ship to the 4 carbon-cache instances. Now each server has 1 million metrics/minute going to it, and upon increasing the number of carbon-cache instances on the initial server, now carbon-cache-a and carbon-cache-c shoot up to MAX_CACHE. Turning off either a or c will cause b or d to increase in cache.size, until I start up a/c again - which will cause the cache.size of b/d to decrease as a/c increases again.

CPU use is ~20% for each instance, disk utilization is 75-90%. Carbon-relay-c shows the same number of metrics going to each carbon-cache.

Second new server (with newer/better ssds) looks like a healthy server - 100% cpu for each carbon-cache (looking at expanding that), 10-30% disk utilization, cache.size of each cache isn't constantly growing.

Are there any performance tips I can look further into for my initial server? Changing MAX_UPDATES_PER_SECOND from anywhere between 2500 to 50000 doesn't seem to make a difference, the changes mentioned above didn't improve the issue, I'm down to thinking upgrading to 0.9.15 caused some issue at some point, the SSDs are too old (the server is about 2-3 years old now), or I should switch carbon-c-relay/carbon-cache on the storage servers to using UDP to reduce overhead, or some other sysctl tuning is in order.

Happy to post any graphs/configs that may help assist with this - the issue is odd in that the server has been running fine for months before this with about the same number of metrics.

--
You received this question notification because your team graphite-dev
is an answer contact for Graphite.