← Back to team overview

graphite-dev team mailing list archive

Re: [Question #244564]: carbon-cache/whisper stopped writing wsp files over NFS share

 

Question #244564 on Graphite changed:
https://answers.launchpad.net/graphite/+question/244564

Travis Groth posted a new comment:
That’s not a very helpful answer or comment, Jason.  I was hoping to get
information on how to debug the issue, not from-the-hip ranting about
configuration details that should not matter.

- Network storage shouldn’t break carbon assuming it can keep up (but
we’re not using it)

- Setting a cache size max shouldn’t be a “bad” idea.  Memory bounding
is a good thing.  We have configured our system both with and without a
limit.  In this case, using non-inf appears broken due to a very long
standing yet unfixed bug https://github.com/graphite-
project/carbon/issues/167.  If I turn on cache size limits now, carbon
basically becomes unresponsive immediately.  So, granted, non-inf is a
bad idea but only because cache sizing is broken currently in 0.9.12.

- Running a VM will have no impact on carbon’s stability as long as the
IO keeps up.  It is, in our case.  Our IO wait was less than 20% at any
given time.

While I have taken straces during the issue, and I will do so again if
needed, they revealed nothing meaningful to me or the team here (mostly
futex calls).

That said, we've done further research and the "bug" isn't so much a bug
but unintuitive behavior when carbon-cache is CPU bound.  It will accept
metrics and cache them immediately (and seemingly as the highest
priority activity/task/event/callback/whatever), but since a carbon
instance is effectively limited to approximately 1 CPU, it may not have
enough processing time to actually flush the cache on the back end.
This leads to a very dramatic tipping point in CPU usage that is hard to
observe from the OS - if that "100%" CPU isn't balanced between
accepting metrics and actually doing something with them you
continuously fall behind and expand your memory footprint.

This turns into a failure in two ways  - 1) as the cache grows, carbon
slows down until it is unusable and/or gets OOM killed 2) it never
actually flushes to disk at some cache size, even when you stop the
daemon with no MAX_UPDATES_PER_SECOND_ON_SHUTDOWN.  We might have 5G of
metrics cached (say, 2 hours of metrics) and when we stop the process
we'll lose most of that two hours of data.

So, we wound up configuring a carbon-cache instance for every vCPU on
the system and then putting carbon-relay in front of all of them with
consistent hashing.  We now appear to be keeping up, though I'm now
seeing carbon-relay is approaching being CPU bound.  Once that occurs, I
imagine our configuration will get even more complicated.  It would be
very nice if carbon could, itself, scale to multiple CPUs without the
administrator orchestrating haproxy, multiple layers of processes, port
differentiation, hashing, etc.

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.