← Back to team overview

graphite-dev team mailing list archive

Re: [Question #285063]: Twisted MemoryError / MetricCache is full

 

Question #285063 on Graphite changed:
https://answers.launchpad.net/graphite/+question/285063

Description changed to:
Hello, and a pre-emptive "thank you" for your time.

I have a beastly server for Graphite:

CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM: 128GB
Disks:
- /dev/sda running the OS, dual SSDs in RAID1.
- /dev/sdb holding the Whisper DBs, quad SSDs in 2TB RAID1+0, mounted at /opt/graphite/storage.

Relevant package versions:
carbon==0.9.15
ceres==0.10.0
Django==1.4.14
Twisted==15.5.0
whisper==0.9.14

I'm tuning this box for maximum I/O. I have 8 carbon-cache instances
running, load balanced with a local instance of HAProxy on TCP 2003
(major performance issues with carbon-relay, carbon-relay-ng, and other
alternatives). All block devices are using the Noop disk scheduler. I
have Collectd deployed to a few hundred machines and have about 8MM
metrics per second incoming, all of which are actually being committed
(per pointsCommitted values graphed).

However, I'm experiencing issues similar to this thread:
https://answers.launchpad.net/graphite/+question/186672

To summarize the problem, the cache of one of the instances (not always
the same instance) will steadily grow over the course of 15-20 minutes
until it reaches MAX_CACHE_SIZE, at which point holes in the data will
start appearing for whichever metrics that carbon-cache daemon instance
is handling. If I kill -9 the process and restart it, the cache size
drops to zero, all the missing data magically appears and life goes on
for another 15-20 minutes. I expect this is normal high watermark
behavior handling, however I'm unable to determine why a random carbon-
cache process will get "stuck" and the cache starts piling up in the
first place.

Here's some relevant output from the console.log for one of the carbon-
cache instances:

=====

11/02/2016 01:02:37 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/writer.py", line 158, in writeForever
    writeCachedDataPoints()
  File "/opt/graphite/lib/carbon/writer.py", line 146, in writeCachedDataPoints
    instrumentation.append('updateTimes', updateTime)
  File "/opt/graphite/lib/carbon/instrumentation.py", line 41, in append
    stats[stat].append(value)
exceptions.MemoryError: 

11/02/2016 01:03:19 :: MetricCache is full: self.size=15000000
11/02/2016 01:03:19 :: Exception in cacheFull event handler: args=() kwargs={}
Traceback (most recent call last):
  File "/opt/graphite/lib/carbon/protocols.py", line 78, in lineReceived
    self.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/protocols.py", line 64, in metricReceived
    events.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/cache.py", line 48, in store
    events.cacheFull()
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/events.py", line 32, in <lambda>
    cacheFull.addHandler(lambda: carbon.instrumentation.increment('cache.overflow'))
exceptions.NameError: global name 'carbon' is not defined

=====

Relevant carbon.conf config excerpt:

=====

[cache]

ENABLE_LOGROTATION = True

USER = root

MAX_CACHE_SIZE = 15000000
MAX_UPDATES_PER_SECOND = 15000000 (yes really)
MAX_UPDATES_PER_SECOND_ON_SHUTDOWN = inf
MAX_CREATES_PER_MINUTE = inf

LOG_LISTENER_CONNECTIONS = False

USE_FLOW_CONTROL = True

LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False

CACHE_WRITE_STRATEGY = naive

WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = True

=====

Relevant storage-schema.conf:

=====

[default]
pattern = .*
retentions = 1m:1d,5m:10d,30m:15d,60m:365d 

=====

Relevant storage-aggregation.conf:

=====

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average

=====

Graphite-Web configuration is out-of-the-box. Relevant httpd.conf:

=====

<IfModule mpm_worker_module>
MaxClients 128
ServerLimit 128
StartServers 10
MaxConnectionsPerChild 10240
</IfModule>

=====

Relevant output of strace -f:

=====

[pid 22915] lseek(17, -1340, SEEK_CUR)  = 47812
[pid 22915] write(17, "V\274\234\24A^\17\347\232\366\35p", 12) = 12
[pid 22915] lseek(17, 16384, SEEK_SET)  = 16384
[pid 22915] read(17, "V\273u\264A\20G\332z \24rV\273u\360A\214\265\34L\211\355\300V\273v,A\22\267}"..., 4096) = 4096
[pid 22915] futex(0x1400d90, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 22887] futex(0x1400d90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 22915] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22887] <... futex resumed> )       = 0

=====

I've tried various values for MAX_UPDATES_PER_SECOND between 10000 and
inf and have noticed that, in my configuration, the higher the value,
the less often I see caches getting bloated -- but it still happens.
When the value is under around 50000, values will arrive late to the
system, sometimes up to 10-15m late. When the value is at a ridiculous
number all metrics arrive in a timely fashion but we're still subject to
clobbering by carbon-cache instances getting stuck in cache mode.

At one point I noticed that the kernel scheduler was plopping one pair
of carbon-cache PIDs on a different NUMA node than the other sets, and
tried to see if numactl would make any difference. It didn't.

Metrics Queued usually hangs out around 100k per instance. RAM usage is
around 2.5GB per instance and stable. I've ruled out network I/O as a
bottleneck (~200Mbps on a 1GB link). I can't really get a good read on
disk bus saturation since util% in iostat is inaccurate for RAIDs. I'd
blame memory but even right now there's 18GB completely unallocated free
RAM on the box.

I'm pretty much at a loss as to what to investigate at this point and am
about to dive into the source code of the writer to see if I can find
any leads. If I could get this implementation to the point where the
cache size issue wasn't constantly clobbering my ability to promote it
to Production, I would be a very happy engineer. Any assistance provided
would be greatly appreciated. Let me know if there are any further
questions and I'll respond as quickly as I'm able.

-- 
You received this question notification because your team graphite-dev
is an answer contact for Graphite.