← Back to team overview

graphite-dev team mailing list archive

Re: [Question #178969]: Tuning Graphite for 3M points/second with a single backend machine (a story)

 

Question #178969 on Graphite changed:
https://answers.launchpad.net/graphite/+question/178969

Description changed to:
Hello there!

I'd like to share a story about tuning Graphite performance;
specifically, how we (with my colleague) scaled Graphite to handle 3M
points/minute (~300K distinct metrics). I'd like to hear your comments
and suggestions on how to improve and simplify my current setup.

== Background

I was interested in using Graphite as a quantitative monitoring system
in my current project. We have 1000 worker nodes and we would like to
monitor various system metrics (such as CPU usage, memory usage, network
throughput and so on) with relatively high time resolution (currently 1
point in 5 seconds). Also we would like to compute some aggregate values
like mean load average (just to be sure that aggregation works).

We have a special machine which is supposed to be a storage for that
monitoring data. It's a powerful machine with the following
characteristics:

 - CPU: 4x Intel Xeon E5645 (2.4Ghz, 6 cores)
 - Memory: 48 GB (don't know the exact configuration)
 - Disks: RAID 10 with 16x 7200RPM SATAs

So, we've decided to give Graphite a try.

== How I have configured the clients?

I've installed a custom version of Diamond by Andy Kipp. I didn't change
the logic of Diamond, I just did a few cosmetic changes to the metrics
and to the source code to ease (from my point of view) further
development. You can find my fork of Diamond at
https://github.com/sandello/Diamond/tree/master/src/collectors .

Diamond is configured to collect metrics every 5 seconds and to send
them in batches of 500 points. Each machine is generating about 250
metrics. In total, this results in 250 * (60 / 5) * 1000 = 3M datapoints
per minute or 3000K / 0.5K = 6K batches per minute = 100 batches per
second. All this information goes directly to storage01-01g machine to
the single dedicated port (no load balacing at this stage).

So, now I can formulate what workload Graphite have to handle.

== What is supposed workload?

 - 1000 clients
 - 3M datapoints per minute in 6K batches per minute
 - All traffic goes to the single port on the backend
 - Graphite should compute 5 aggregated metrics; every aggregated metric is computed from 1000 raw metrics

== First attempt

Well, the very first attempt was to handle all this data with a single
carbon-aggregator and carbon-cache. Obviously, that attempt has failed.
:) carbon-cache was CPU-bound, aggregator queue was evergrowing and
clients were failing to send metrics within time limit (which is 15s in
Diamond).

So I with my colleague decided to setup about 10 carbon-caches to make
them IO-bound and let carbon-aggregate balance ()with consistent hashing
strategy) incoming data across those cache instances. In that case each
carbon-cache will handle about 300 metrics * 1K nodes / 10 instances =
30K metrics.

After configuration update carbon-caches became IO-bound (and that's
good, actually). However, cache-aggregator was CPU-bound and was not
able to cope with incoming data.

So we had to do something else.

== Second attempt

Due to GIL python interpreter cannot use available cores to distribute
computations, so we have to run multiple carbon-whatever instances and
balance traffic across them. Also we have to be sure that carbon-
aggregator receives all metrics which are subject to aggregation.

We came up with the following scheme:

http://cl.ly/1R3f3l3X2y150N2W2F3D

- haproxy performs simple round-robin balancing across relay-top**
- relay-top** are used to separate metrics to aggregate from the others,
- relay-middle** are used to ensure consistency while matching metric with appropriate carbon-cache.

With this scheme we were able to shift bottleneck from CPU to IO.
However, we became concerned with the IO pattern of carbon-cache. When
one lucky carbon-cache instance manages to flush all its cache, it
starts to issue a lot of small IO writes. This effectively kills system
throughput because that particular lucky carbon-cache instance does not
know about other cache instances which may suffer from a large in-memory
cache.

So I've implemented the following hack: for each metric Whipser's
update_many() method should be called only if there are at least K
points in the metric's cache. With this small hack the whole system
started to behave in (more) predictable manner: during a short period of
time after the start carbon-caches do not do any updates; when the cache
becomes large enough carbon-caches start to flush their caches to the
disk with large batches but (in contrast to the usual behaviour) carbon-
caches do not try to entirely flush their cache.

Scheme is kinda complicated, but it works (c). Let me show you some
graphs.

== Statistics

* carbon-cache memory usage
  http://cl.ly/110h0i1Y0c2x0m1D2f2Q
  Memory usage if relatively high (total usage is about 20G), but remains constant.

* carbon-cache cpu usage
  http://cl.ly/2i423R3F1j1T0S1H3f1Q
  When carbon-cache flushes its cache the usage is about 37-40%. Periodically usage peaks to 100% because of ugliness of my flush hack (sometimes writer's loop reduces to (while True: pass); I'm planning to fix that soon.).

* cache size
  http://cl.ly/2g3M2t3O1D07392D0s1U
  Cache size oscillates around setpoint. For some strange reason the red cache receives more metrics hence its cache size is larger.
  
* Received-vs-Committed points
  http://cl.ly/1t002C022Z0K2B2g3S1g
  Red -- points received by caches, purple -- points committed by caches. Green (equals to blue) -- points received by middle and top layer resp.
  
* carbon-relay cpu usage
  http://cl.ly/2G1X3x2u0Y2L250y2Z43
  Blue -- middle layer, green -- top layer. I suspect that current implementation of consistent hashing is CPU-hungry hence the difference in CPU usage.
  
* carbon-relay memory usage
  http://cl.ly/3r182U3F0f42130G3k0C
  Memory usage is quite low and constant.

== Ideas for further improvements

* I think that my hack with minimal batch size for a given metric can be
greatly improved by implementing some kind of PID controller
(http://en.wikipedia.org/wiki/PID_controller) this will allow carbon-
cache to damp "number of committed points" oscillation and adapt to
various workloads. Smooth and consistent number of committed points is
crucial for keeping cache size bounded.

* Some kind of high-performance relay would be nice. I didn't any
profiling, so I can't tell the exact ways to optimize relaying. However,
I think that migrating from pickle to protobuf as message encoding
format would be nice. Because in that case it would be quite easy to
write relaying daemon in C++.

=== Closing

Thanks for your attention. Any ideas on how to simplify current
configuration are welcome. Any other comments and suggestions are
welcome too.

(Edit: Replaced textual scheme with the picture.)

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.