← Back to team overview

graphite-dev team mailing list archive

[Question #285063]: Twisted MemoryError / MetricCache is full

 

New question #285063 on Graphite:
https://answers.launchpad.net/graphite/+question/285063

Hello, and a pre-emptive "thank you" for your time.

I have a beastly server for Graphite:

CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM: 128GB
Disks:
- /dev/sda running the OS, dual SSDs in RAID0.
- /dev/sdb holding the Whisper DBs, quad SSDs in 2TB RAID1+0, mounted at /opt/graphite/storage.

Relevant package versions:
carbon==0.9.15
ceres==0.10.0
Django==1.4.14
Twisted==15.5.0
whisper==0.9.14

I'm tuning this box for maximum I/O. I have 8 carbon-cache instances running, load balanced with a local instance of HAProxy on TCP 2003 (major performance issues with carbon-relay, carbon-relay-ng, and other alternatives). All block devices are using the Noop disk scheduler. I have Collectd deployed to a few hundred machines and have about 8MM metrics per second incoming, all of which are actually being committed (per pointsCommitted values graphed).

However, I'm experiencing issues similar to this thread: https://answers.launchpad.net/graphite/+question/186672

To summarize the problem, the cache of one of the instances (not always the same instance) will steadily grow over the course of 15-20 minutes until it reaches MAX_CACHE_SIZE, at which point holes in the data will start appearing for whichever metrics that carbon-cache daemon instance is handling. If I kill -9 the process and restart it, the cache size drops to zero, all the missing data magically appears and life goes on for another 15-20 minutes. I expect this is normal high watermark behavior handling, however I'm unable to determine why a random carbon-cache process will get "stuck" and the cache starts piling up in the first place.

Here's some relevant output from the console.log for one of the carbon-cache instances:

=====

11/02/2016 01:02:37 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/writer.py", line 158, in writeForever
    writeCachedDataPoints()
  File "/opt/graphite/lib/carbon/writer.py", line 146, in writeCachedDataPoints
    instrumentation.append('updateTimes', updateTime)
  File "/opt/graphite/lib/carbon/instrumentation.py", line 41, in append
    stats[stat].append(value)
exceptions.MemoryError: 

11/02/2016 01:03:19 :: MetricCache is full: self.size=15000000
11/02/2016 01:03:19 :: Exception in cacheFull event handler: args=() kwargs={}
Traceback (most recent call last):
  File "/opt/graphite/lib/carbon/protocols.py", line 78, in lineReceived
    self.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/protocols.py", line 64, in metricReceived
    events.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/cache.py", line 48, in store
    events.cacheFull()
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/events.py", line 32, in <lambda>
    cacheFull.addHandler(lambda: carbon.instrumentation.increment('cache.overflow'))
exceptions.NameError: global name 'carbon' is not defined

=====

Relevant carbon.conf config excerpt:

=====

[cache]

ENABLE_LOGROTATION = True

USER = root

MAX_CACHE_SIZE = 15000000
MAX_UPDATES_PER_SECOND = 15000000 (yes really)
MAX_UPDATES_PER_SECOND_ON_SHUTDOWN = inf
MAX_CREATES_PER_MINUTE = inf

LOG_LISTENER_CONNECTIONS = False

USE_FLOW_CONTROL = True

LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False

CACHE_WRITE_STRATEGY = naive

WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = True

=====

Relevant storage-schema.conf:

=====

[default]
pattern = .*
retentions = 1m:1d,5m:10d,30m:15d,60m:365d 

=====

Relevant storage-aggregation.conf:

=====

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average

=====

Graphite-Web configuration is out-of-the-box. Relevant httpd.conf:

=====

<IfModule mpm_worker_module>
MaxClients 128
ServerLimit 128
StartServers 10
MaxConnectionsPerChild 10240
</IfModule>

=====

Relevant output of strace -f:

=====

[pid 22915] lseek(17, -1340, SEEK_CUR)  = 47812
[pid 22915] write(17, "V\274\234\24A^\17\347\232\366\35p", 12) = 12
[pid 22915] lseek(17, 16384, SEEK_SET)  = 16384
[pid 22915] read(17, "V\273u\264A\20G\332z \24rV\273u\360A\214\265\34L\211\355\300V\273v,A\22\267}"..., 4096) = 4096
[pid 22915] futex(0x1400d90, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 22887] futex(0x1400d90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 22915] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22887] <... futex resumed> )       = 0

=====

I've tried various values for MAX_UPDATES_PER_SECOND between 10000 and inf and have noticed that, in my configuration, the higher the value, the less often I see caches getting bloated -- but it still happens. When the value is under around 50000, values will arrive late to the system, sometimes up to 10-15m late. When the value is at a ridiculous number all metrics arrive in a timely fashion but we're still subject to clobbering by carbon-cache instances getting stuck in cache mode.

At one point I noticed that the kernel scheduler was plopping one pair of carbon-cache PIDs on a different NUMA node than the other sets, and tried to see if numactl would make any difference. It didn't.

Metrics Queued usually hangs out around 100k per instance. RAM usage is around 2.5GB per instance and stable. I've ruled out network I/O as a bottleneck (~200Mbps on a 1GB link). I can't really get a good read on disk bus saturation since util% in iostat is inaccurate for RAIDs. I'd blame memory but even right now there's 18GB completely unallocated free RAM on the box.

I'm pretty much at a loss as to what to investigate at this point and am about to dive into the source code of the writer to see if I can find any leads. If I could get this implementation to the point where the cache size issue wasn't constantly clobbering my ability to promote it to Production, I would be a very happy engineer. Any assistance provided would be greatly appreciated. Let me know if there are any further questions and I'll respond as quickly as I'm able.

-- 
You received this question notification because your team graphite-dev
is an answer contact for Graphite.