graphite-dev team mailing list archive
-
graphite-dev team
-
Mailing list archive
-
Message #01751
Re: [Question #178969]: Tuning Graphite for 3M points/minute with a single backend machine (a story)
Nice writeup... it is on the front page of HN now:
http://news.ycombinator.com/item?id=3243310
On Wed, Nov 16, 2011 at 8:40 AM, Ivan Pouzyrevsky <
question178969@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Question #178969 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/178969
>
> Description changed to:
> Hello there!
>
> I'd like to share a story about tuning Graphite performance;
> specifically, how we (with my colleague) scaled Graphite to handle 3M
> points/minute (~300K distinct metrics). I'd like to hear your comments
> and suggestions on how to improve and simplify my current setup.
>
> == Background
>
> I was interested in using Graphite as a quantitative monitoring system
> in my current project. We have 1000 worker nodes and we would like to
> monitor various system metrics (such as CPU usage, memory usage, network
> throughput and so on) with relatively high time resolution (currently 1
> point in 5 seconds). Also we would like to compute some aggregate values
> like mean load average (just to be sure that aggregation works).
>
> We have a special machine which is supposed to be a storage for that
> monitoring data. It's a powerful machine with the following
> characteristics:
>
> - CPU: 4x Intel Xeon E5645 (2.4Ghz, 6 cores)
> - Memory: 48 GB (don't know the exact configuration)
> - Disks: RAID 10 with 16x 7200RPM SATAs
>
> So, we've decided to give Graphite a try.
>
> == How I have configured the clients?
>
> I've installed a custom version of Diamond by Andy Kipp. I didn't change
> the logic of Diamond, I just did a few cosmetic changes to the metrics
> and to the source code to ease (from my point of view) further
> development. You can find my fork of Diamond at
> https://github.com/sandello/Diamond/tree/master/src/collectors .
>
> Diamond is configured to collect metrics every 5 seconds and to send
> them in batches of 500 points. Each machine is generating about 250
> metrics. In total, this results in 250 * (60 / 5) * 1000 = 3M datapoints
> per minute or 3000K / 0.5K = 6K batches per minute = 100 batches per
> second. All this information goes directly to storage01-01g machine to
> the single dedicated port (no load balacing at this stage).
>
> So, now I can formulate what workload Graphite have to handle.
>
> == What is supposed workload?
>
> - 1000 clients
> - 3M datapoints per minute in 6K batches per minute
> - All traffic goes to the single port on the backend
> - Graphite should compute 5 aggregated metrics; every aggregated metric
> is computed from 1000 raw metrics
>
> == First attempt
>
> Well, the very first attempt was to handle all this data with a single
> carbon-aggregator and carbon-cache. Obviously, that attempt has failed.
> :) carbon-cache was CPU-bound, aggregator queue was evergrowing and
> clients were failing to send metrics within time limit (which is 15s in
> Diamond).
>
> So I with my colleague decided to setup about 10 carbon-caches to make
> them IO-bound and let carbon-aggregate balance ()with consistent hashing
> strategy) incoming data across those cache instances. In that case each
> carbon-cache will handle about 300 metrics * 1K nodes / 10 instances =
> 30K metrics.
>
> After configuration update carbon-caches became IO-bound (and that's
> good, actually). However, cache-aggregator was CPU-bound and was not
> able to cope with incoming data.
>
> So we had to do something else.
>
> == Second attempt
>
> Due to GIL python interpreter cannot use available cores to distribute
> computations, so we have to run multiple carbon-whatever instances and
> balance traffic across them. Also we have to be sure that carbon-
> aggregator receives all metrics which are subject to aggregation.
>
> We came up with the following scheme:
>
> http://cl.ly/1R3f3l3X2y150N2W2F3D
>
> - haproxy performs simple round-robin balancing across relay-top**
> - relay-top** are used to separate metrics to aggregate from the others,
> - relay-middle** are used to ensure consistency while matching metric with
> appropriate carbon-cache.
>
> With this scheme we were able to shift bottleneck from CPU to IO.
> However, we became concerned with the IO pattern of carbon-cache. When
> one lucky carbon-cache instance manages to flush all its cache, it
> starts to issue a lot of small IO writes. This effectively kills system
> throughput because that particular lucky carbon-cache instance does not
> know about other cache instances which may suffer from a large in-memory
> cache.
>
> So I've implemented the following hack: for each metric Whipser's
> update_many() method should be called only if there are at least K
> points in the metric's cache. With this small hack the whole system
> started to behave in (more) predictable manner: during a short period of
> time after the start carbon-caches do not do any updates; when the cache
> becomes large enough carbon-caches start to flush their caches to the
> disk in large batches but (in contrast to the usual behaviour) carbon-
> caches do not try to entirely flush their cache.
>
> Scheme is kinda complicated, but it works (c). Let me show you some
> graphs.
>
> == Statistics
>
> * carbon-cache memory usage
> http://cl.ly/110h0i1Y0c2x0m1D2f2Q
> Memory usage if relatively high (total usage is about 20G), but remains
> constant.
>
> * carbon-cache cpu usage
> http://cl.ly/2i423R3F1j1T0S1H3f1Q
> When carbon-cache flushes its cache the usage is about 37-40%.
> Periodically usage peaks to 100% because of ugliness of my flush hack
> (sometimes writer's loop reduces to (while True: pass); I'm planning to fix
> that soon.).
>
> * cache size
> http://cl.ly/2g3M2t3O1D07392D0s1U
> Cache size oscillates around setpoint. For some strange reason the red
> cache receives more metrics hence its cache size is larger.
>
> * Received-vs-Committed points
> http://cl.ly/1t002C022Z0K2B2g3S1g
> Red -- points received by caches, purple -- points committed by caches.
> Green (equals to blue) -- points received by middle and top layer resp.
>
> * carbon-relay cpu usage
> http://cl.ly/2G1X3x2u0Y2L250y2Z43
> Blue -- middle layer, green -- top layer. I suspect that current
> implementation of consistent hashing is CPU-hungry hence the difference in
> CPU usage.
>
> * carbon-relay memory usage
> http://cl.ly/3r182U3F0f42130G3k0C
> Memory usage is quite low and constant.
>
> == Ideas for further improvements
>
> * I think that my hack with minimal batch size for a given metric can be
> greatly improved by implementing some kind of PID controller
> (http://en.wikipedia.org/wiki/PID_controller) this will allow carbon-
> cache to damp "number of committed points" oscillation and adapt to
> various workloads. Smooth and consistent number of committed points is
> crucial for keeping cache size bounded.
>
> * Some kind of high-performance relay would be nice. I didn't any
> profiling, so I can't tell the exact ways to optimize relaying. However,
> I think that migrating from pickle to protobuf as message encoding
> format would be nice. Because in that case it would be quite easy to
> write relaying daemon in C++.
>
> === Closing
>
> Thanks for your attention. Any ideas on how to simplify current
> configuration are welcome. Any other comments and suggestions are
> welcome too.
>
> (Edit: Replaced textual scheme with the picture.)
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : graphite-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>
Follow ups
References