graphite-dev team mailing list archive

Thread
Date
Re: [Question #145032]: High CPU utilization on graphite server

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: chrismd <question145032@xxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 18 Feb 2011 01:42:29 -0000
Reply-to: question145032@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
Question #145032 on Graphite changed:
https://answers.launchpad.net/graphite/+question/145032

    Status: Open => Answered

chrismd proposed the following answer:
I'm not terribly familiar with the specs for AWS instances so its hard
to say but Graphite typically reaches I/O bottlenecks long before CPU
bottlenecks.

The carbon metrics don't have the most descriptive names so I'll give
you a quick run-down of what they all mean, but first I need to explain
a little bit about what carbon actually does.

First it receives datapoints from clients. I have the bad habit of
mixing the terms "metrics" and "datapoints" (aka. "points"). The
'metricsReceived' tells you how many (metric, datapoint) pairs carbon is
receiving each minute.

Once received each datapoint is put in a queue associated with its
metric. Carbon has a separate writer thread that iterates all the queues
and writes them to disk. The collection of all the queues is often
referred to as "the cache" because it is queried by the webapp whenever
a graph is requested. Since the queues serve the purpose of temporary
storage for pending writes they could also be described as buffers. Take
your pick of terminology, they're multi-purpose data structures :)

Anyways, here's the run-down:

cache.queries - the number of queries made against "the cache".

cache.queues - the number of queues in the cache, which logically
corresponds to the number of distinct metrics that have datapoints
waiting to be written.

cache.size - the sum total of the sizes of all the queues (the number of
datapoints in "the cache").

metricsReceived - the number of (metric, datapoint) pairs received by
carbon.

cpuUsage - carbon's own measurement of its user + system cpu time.

creates - the number of new metrics (new wsp files) created each minute,
this is typically 0.

errors - a quantitative measurement of bad joo-joo.

updateOperations - as the writer thread iterates all the queues in the
cache, it takes a queue and writes all of its datapoints to a wsp file
in a single update operation. This measures the number of update
operations occurring each minute. Note that some updates may be a single
datapoint while others may involve many datapoints, depending on how
much data is in the queues.

pointsPerUpdate - the average number of datapoints written in each
update during the minute.

avgUpdateTime - the average time each update operation takes. In my
youthful stupidity I chose to measure this in seconds, thus the values
are typically extremely small... Likely to change to microseconds in the
future.

committedPoints - the total number of datapoints written each minute.
Generally this should be equal to updateOperations times
pointsPerUpdate.

The reason metricsReceived is usually equal to committedPoints is
because Graphite tends to reach an equilibrium once the cache grows to a
certain size. The larger the cache, the larger the pointsPerUpdate and
thus the larger the committedPoints. The committedPoints will generally
grow until it equals metricsReceived.

I hope that helps, back to your CPU issue I would be curious to see
graphs of these metrics for your system in order to help you diagnose
what is happening.

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.