← Back to team overview

graphite-dev team mailing list archive

Re: [Question #133636]: queue size - writes don't keep up with volume

 

Question #133636 on Graphite changed:
https://answers.launchpad.net/graphite/+question/133636

    Status: Open => Answered

chrismd proposed the following answer:
Hi Bryan. Each time an application, like carbon, tries to write data to
disk it doesn't actually immediately go to disk, the kernel puts it in a
buffer and it gets written to disk later on. The kernel does this for
efficiency and it results in very low write latency as long as there is
free memory for the kernel to allocate buffers in. The reason this is so
important to carbon is because carbon depends on I/O latency being very
low. If the I/O latency increases significantly, the rate at which
carbon writes data drops dramatically, causing the cache to grow and
carbon starts dropping datapoints (except in your case, carbon will
simply crash because your MAX_CACHE_SIZE is infinite).

Carbon needs to keep writing data to disk very quickly otherwise the
influx of new data will overwhelm it. What is probably happening is that
your Graphite server's kernel runs out of free memory to use for
buffering and when this happens carbon's writes become synchronous.
Disks are obviously much slower than memory, even when on a fast SAN,
and the drop in write latency (which you can monitor with the
carbon.agents.<server>.avgUpdateTime metric) causes the cache to start
growing. What you will probably see if you look at the history of this
metric alongside the cache size, is that the avgUpdateTime will be
fairly low and suddenly jump a lot higher, at the same time the cache
will start growing until it hits a critical point and the app crashes or
just stops working.

There are many ways to solve this problem, but in your case the easiest
would be some carbon.conf changes:

1) Don't use MAX_CACHE_SIZE = inf, it just means the inevitable outcome
of an I/O latency problem will be a crash. If you put a limit on the
cache size (I used 10 million on machines w/24G memory) then the outcome
will be sporadic dropped datapoints until the latency is back down and
the cache goes below the limit.

2) MAX_UPDATES_PER_SECOND is waaaaay too high. I'll post the comment from carbon.conf.example that explains why:
# Limits the number of whisper update_many() calls per second, which effectively
# means the number of write requests sent to the disk. This is intended to
# prevent over-utilizing the disk and thus starving the rest of the system.
# When the rate of required updates exceeds this, then carbon's caching will
# take effect and increase the overall throughput accordingly.

The idea is to actually slow down the rate of write calls to avoid
causing I/O cache starvation. It may sound counter-intuitive but it
helps in practice. Essentially it lets you strike a balance between the
use of carbon's caching mechanism vs the kernel's buffering mechanism.
You should look at your updates.log to see how many updates are done per
second on your systems, set your value to about 80% of that. Experiment
with this to see what works for you. I use a value of 800.

3) MAX_CREATES_PER_MINUTE = inf is the culprit for your problems. Each time a new metric is received by carbon, it has to allocate a new whisper file (which can be a few megs depending on your configuration). This creation process is a big write that the kernel puts into a buffer just like every other write, except that it takes up a lot of room when you've got hundreds of new metrics and thus hundreds of new files being allocated and thus hundreds of big writes filling up the available buffers, which means there isn't any more room for buffering the updates to existing metrics, making carbon's writes synchronous, yadda yadda yadda, carbon goes boom. Here is the warning from carbon.conf:
# Softly limits the number of whisper files that get created each minute.
# Setting this value low (like at 50) is a good way to ensure your graphite
# system will not be adversely impacted when a bunch of new metrics are
# sent to it. The trade off is that it will take much longer for those metrics'
# database files to all get created and thus longer until the data becomes usable.
# Setting this value high (like "inf" for infinity) will cause graphite to create
# the files quickly but at the risk of slowing I/O down considerably for a while.

I use a value of 60. It is probably very bad of me to have a default
value of 'inf' in the example config file, sorry about that :)

Note that if you get large sets of new metrics frequently it is still
possible that they can overwhelm carbon because they'll drain from the
cache at a slow rate of 60 metrics/minute (or whatever you set) which
can cause the same problem but over a longer period of time. This is a
problem I solved fairly recently and if you're running a version of
carbon older than say a month or two then you probably don't have the
fix applied.

In summary, I suggest adjusting your carbon.conf settings as I've
described and also updating to the latest version of carbon.

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.