graphite-dev team mailing list archive

Thread
Date
Re: [Question #170794]: carbon-cache.py at its limit?

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: chrismd <question170794@xxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 11 Sep 2011 04:45:42 -0000
Reply-to: question170794@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
Question #170794 on Graphite changed:
https://answers.launchpad.net/graphite/+question/170794

    Status: Open => Answered

chrismd proposed the following answer:
Aman is right you will see gaps if datapoints are not sent at expected
intervals per your storage-schemas.conf, but I'll assume for the moment
that isn't the problem.

Once you get graphite over a few hundred thousand metrics per minute
some extra config tuning is often required. Also it is quite likely that
the severity of the performance problems was due to the creates. Once
the creates have largely stopped performance will often continue to
suffer for a while afterwards because creates pollute the system's I/O
buffers with useless data, causing subsequent writes to be synchronous
until the I/O buffers get repopulated with useful data. By "useful" I
mean the first and last block of each wsp file. When the first block is
cached, reading the header of each file (which is done for each write
operation) is much faster and the last block makes graphing requests
much less expensive.

That said, that itself is a temporary problem that should simply go away
after "a while" (anywhere from a few minutes to a couple hours depending
on how many wsp files you have, how much ram you have, etc). Once that
has passed the most important config options for you to tune are as
follows:

Note the goal here is to find a balance that gives you stable
performance, watch utilization % in iostat, your disks should be in the
25-50% utilized range.

If your disks are over 50% utilized, you probably want to lower
MAX_UPDATES_PER_SECOND. The default of 1000 is too high, I'd try
something in the 500-800 range (depending on how fast your storage is).

If your utilization is too low, look at carbon's CPU usage (all the
daemons involved). If it is CPU-bound there are various ways to address
that.

Your MAX_CACHE_SIZE should be at least double your "equilibrium cache
size", where equilibrium is the steady state achieved after the system
has been running for a while (I/O buffers are primed as I described
above). It is possible to set this too high however, the larger it is
the more memory carbon-cache can use. There is a tipping point at which
carbon-cache's size starves the system of free memory to use for I/O
buffers, which slows down throughput even more, which causes the cache
to continue to grow until... bad things happen. This requires some
testing as every system is different, I used a value of 10 million here
on a system that had 48G of ram. Your mileage may vary.

It can be tempting to raise MAX_CREATES_PER_MINUTE as you did, but the
fallout is that the system's I/O buffers can get polluted with a bunch
of brand new wsp files, which are huge compared to datapoint updates,
thus the system can run out of buffer space quickly causing writes to
become synchronous. Once writes are synchronous performance suffers
massively and you can only wait until buffer space becomes available
again. That is why I suggest leaving it at a low value like 60, because
it is low enough that your system can keep running and constantly be
creating metrics without hurting performance. That's the theory anyways,
it has worked quite well for me in the past.

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.