← Back to team overview

graphite-dev team mailing list archive

[Question #240674]: Aggregation Spikiness and Relay Receive/Send Metric Gaps

 

New question #240674 on Graphite:
https://answers.launchpad.net/graphite/+question/240674

I am trying to set up a Graphite cluster capable of handling 500K metric datapoints every 10 seconds - as a starting point. After navigating through some of the answers in this site, blog posts and other documentation, I have set up the following configuration:

- 2 machines with 8 cores, 32 GB of memory and 3 TB of storage each
- On each machine:
     - 5 carbon-relays
     - 9 carbon-aggregators
     - 9 carbon-caches

In total, there are 10 relays, 18 aggregators and 18 caches in the cluster. Each aggregator communicates with a single cache - it's 1-to-1. The webapps are configured to speak to their corresponding host's caches. An haproxy load balancer receives all the metric traffic and distributes the load among the 10 relays. The 18 aggregators are specified as the destinations of each of the relays in the configuration file. The relays are configured with aggregated-consistent-hashing to group metrics that would be aggregated, based on the aggregation rules, in the same cache.

This setup behaves well. I have been able to run stress tests on the cluster, publishing larger sets of metrics incrementally to monitor the cluster health at every point. However, I have noticed that there are issues with the aggregated metrics.

For example, in the screenshot linked below, the graph on the right shows the raw values received. The graph on the left shows the aggregated values computed from the raw values. In this case, this metric's aggregation is defined as a sum in the aggregation rules configuration file.

http://bit.ly/1hO6bBQ

If I do the sum by hand, the result is a value around 750. Clearly not what Graphite is computing. This happens for *all* aggregated metrics in my cluster. While investigating this issue, I also noticed something strange when comparing the number of metrics received by the relays against the number of metrics sent by the relays to the aggregators. In the screenshot below, the graph on the right shows that the relays received around 280K metrics. However, only around 140K of those are sent to the aggregators.

http://bit.ly/1f91Q85

If I enable whitelists and reduce the number of metrics processed by the cluster, the aggregations start functioning properly again and the relay's received vs sent metrics start to match again. See screenshots linked below:

http://bit.ly/1aXJyST
http://bit.ly/18EP28u

Questions:

- Any insight into why the aggregated metrics are "spiky" while the corresponding raw values look correct?
- Is there a scenario in which a relay will send less metrics than it receives?
- Does my setup makes sense? Is there a better way to scale a Graphite cluster?

I would greatly appreciate any help.

Thanks!

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.