graphite-dev team mailing list archive

Thread
Date

[Question #170794]: carbon-cache.py at its limit?

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: Cody Stevens <question170794@xxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 11 Sep 2011 03:55:47 -0000
Reply-to: question170794@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

New question #170794 on Graphite:
https://answers.launchpad.net/graphite/+question/170794

I have 2 cache servers running, both accepting metrics from multiple relays. The relays are configured to send to both cache servers for data redundancy. Earlier this week I was accepting metrics from 5 of our smaller datacenters which are just DR datacenters. The metrics received was pretty consistantly around 100k. I began adding in other datacenters and can see the metrics jump each time a new datacenter was added. Somewhere around 300k metrics received the relay queues filled up and of course created a huge hole in my graphs. I figured it was due to the MAX_CREATES_PER_MINUTE being set to low. (50). I tweaked some of the settings to match what chrismd had mentioned he had used in another question.

Specifically I changed the following:

MAX_CACHE_SIZE = 10000000
MAX_CREATES_PER_MINUTE = 60

I also changed this to True:
LOG_UPDATES = True

Because Chris had mentioned that would give an idea what to set the MAX_UPDATES_PER_SECOND to.

Currently
MAX_UPDATES_PER_SECOND = 1000

Unfortunately, there were TONS of new metrics and I figured the creates were causing the bottleneck. After letting it run for about 12 hours the graphs were still looking pretty bad ( lots of holes ) I figured what the heck. The graphs were not informative at this point so what harm could more disk I/O do. I bumped the MAX_CREATES_PER_MINUTE to 600 and let 'er rip.

At this point looking at the creates log I am down to 5 or 6 new metrics each hour but the graphs are still missing lots of datapoints. I was thinking after most of the creates had happened that this may resolve itself. Should I bump the MAX_UPDATES_PER_SECOND up? I think I remember Chris saying 'less is more' in that case but can't find the question he had that answer in. Have I hit the threshold of maximum metrics sent to one machine? I was thinking of clustering the 2 cache servers and configuring half of the relays to go to each. Is this my only option? I thought I read that I could run multiple cache daemons on one server listening on different ports in order to take advantage of multiple processors. At this points the graphs have lost their usefulness other than being able to look at them and tell something is wrong with our graphing platform. :) .

Thanks!
Cody

--
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.