graphite-dev team mailing list archive
-
graphite-dev team
-
Mailing list archive
-
Message #01747
[Question #178969]: Tuning Graphite for 3M points/second with a single backend machine (a story)
New question #178969 on Graphite:
https://answers.launchpad.net/graphite/+question/178969
Hello there!
I'd like to share a story about tuning Graphite performance; specifically, how we (with my colleague) scaled Graphite to handle 3M points/minute (~300K distinct metrics). I'd like to hear your comments and suggestions on how to improve and simplify my current setup.
== Background
I was interested in using Graphite as a quantitative monitoring system in my current project. We have 1000 worker nodes and we would like to monitor various system metrics (such as CPU usage, memory usage, network throughput and so on) with relatively high time resolution (currently 1 point in 5 seconds). Also we would like to compute some aggregate values like mean load average (just to be sure that aggregation works).
We have a special machine which is supposed to be a storage for that monitoring data. It's a powerful machine with the following characteristics:
- CPU: 4x Intel Xeon E5645 (2.4Ghz, 6 cores)
- Memory: 48 GB (don't know the exact configuration)
- Disks: RAID 10 with 16x 7200RPM SATAs
So, we've decided to give Graphite a try.
== How I have configured the clients?
I've installed a custom version of Diamond by Andy Kipp. I didn't change the logic of Diamond, I just did a few cosmetic changes to the metrics and to the source code to ease (from my point of view) further development. You can find my fork of Diamond at https://github.com/sandello/Diamond/tree/master/src/collectors .
Diamond is configured to collect metrics every 5 seconds and to send them in batches of 500 points. Each machine is generating about 250 metrics. In total, this results in 250 * (60 / 5) * 1000 = 3M datapoints per minute or 3000K / 0.5K = 6K batches per minute = 100 batches per second. All this information goes directly to storage01-01g machine to the single dedicated port (no load balacing at this stage).
So, now I can formulate what workload Graphite have to handle.
== What is supposed workload?
- 1000 clients
- 3M datapoints per minute in 6K batches per minute
- All traffic goes to the single port on the backend
- Graphite should compute 5 aggregated metrics; every aggregated metric is computed from 1000 raw metrics
== First attempt
Well, the very first attempt was to handle all this data with a single carbon-aggregator and carbon-cache. Obviously, that attempt has failed. :) carbon-cache was CPU-bound, aggregator queue was evergrowing and clients were failing to send metrics within time limit (which is 15s in Diamond).
So I with my colleague decided to setup about 10 carbon-caches to make them IO-bound and let carbon-aggregate balance ()with consistent hashing strategy) incoming data across those cache instances. In that case each carbon-cache will handle about 300 metrics * 1K nodes / 10 instances = 30K metrics.
After configuration update carbon-caches became IO-bound (and that's good, actually). However, cache-aggregator was CPU-bound and was not able to cope with incoming data.
So we had to do something else.
== Second attempt
Due to GIL python interpreter cannot use available cores to distribute computations, so we have to run multiple carbon-whatever instances and balance traffic across them. Also we have to be sure that carbon-aggregator receives all metrics which are subject to aggregation.
We came up with the following scheme:
<pre>
|
haproxy
*
| (round-robin)
|
+---------------+---+-------+--------+
| | | |
v v v v
relay-top01 relay-top02 ... relay-top10
* * * *
|\--------------|\----------|\-------|\-------------\
v v v v |
relay-middle01 relay-middle02 ... relay-middle10 aggregator
| | | | |
* * * * |
(%%%%%%%% consistent hashing %%%%%%%%) <------------/
| | | |
v v v v
cache-01 cache-02 ... cache-10
</pre>
- haproxy performs simple round-robin balancing across relay-top**
- relay-top** are used to separate metrics to aggregate from the others,
- relay-middle** are used to ensure consistency while matching metric with appropriate carbon-cache.
With this scheme we were able to shift bottleneck from CPU to IO. However, we became concerned with the IO pattern of carbon-cache. When one lucky carbon-cache instance manages to flush all its cache, it starts to issue a lot of small IO writes. This effectively kills system throughput because that particular lucky carbon-cache instance does not know about other cache instances which may suffer from a large in-memory cache.
So I've implemented the following hack: for each metric Whipser's update_many() method should be called only if there are at least K points in the metric's cache. With this small hack the whole system started to behave in (more) predictable manner: during a short period of time after the start carbon-caches do not do any updates; when the cache becomes large enough carbon-caches start to flush their caches to the disk with large batches but (in contrast to the usual behaviour) carbon-caches do not try to entirely flush their cache.
Scheme is kinda complicated, but it works (c). Let me show you some graphs.
== Statistics
* carbon-cache memory usage
http://cl.ly/110h0i1Y0c2x0m1D2f2Q
Memory usage if relatively high (total usage is about 20G), but remains constant.
* carbon-cache cpu usage
http://cl.ly/2i423R3F1j1T0S1H3f1Q
When carbon-cache flushes its cache the usage is about 37-40%. Periodically usage peaks to 100% because of ugliness of my flush hack (sometimes writer's loop reduces to (while True: pass); I'm planning to fix that soon.).
* cache size
http://cl.ly/2g3M2t3O1D07392D0s1U
Cache size oscillates around setpoint. For some strange reason the red cache receives more metrics hence its cache size is larger.
* Received-vs-Committed points
http://cl.ly/1t002C022Z0K2B2g3S1g
Red -- points received by caches, purple -- points committed by caches. Green (equals to blue) -- points received by middle and top layer resp.
* carbon-relay cpu usage
http://cl.ly/2G1X3x2u0Y2L250y2Z43
Blue -- middle layer, green -- top layer. I suspect that current implementation of consistent hashing is CPU-hungry hence the difference in CPU usage.
* carbon-relay memory usage
http://cl.ly/3r182U3F0f42130G3k0C
Memory usage is quite low and constant.
== Ideas for further improvements
* I think that my hack with minimal batch size for a given metric can be greatly improved by implementing some kind of PID controller (http://en.wikipedia.org/wiki/PID_controller) this will allow carbon-cache to damp "number of committed points" oscillation and adapt to various workloads. Smooth and consistent number of committed points is crucial for keeping cache size bounded.
* Some kind of high-performance relay would be nice. I didn't any profiling, so I can't tell the exact ways to optimize relaying. However, I think that migrating from pickle to protobuf as message encoding format would be nice. Because in that case it would be quite easy to write relaying daemon in C++.
=== Closing
Thanks for your attention. Any ideas on how to simplify current configuration are welcome. Any other comments and suggestions are welcome too.
--
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.