← Back to team overview

graphite-dev team mailing list archive

Re: [Question #244564]: carbon-cache/whisper stopped writing wsp files over NFS share

 

Question #244564 on Graphite changed:
https://answers.launchpad.net/graphite/+question/244564

    Status: Expired => Open

D DC is still having a problem:
Hello,

I have two Linux Servers for collectd/carbon-cache/graphite solution:
1st server (server1) runs collectd-receiver and carbon-cache. Whisper files are written over an NFS share. It receives metrics from about 40 agents (linux servers).
2nd server (server2) runs graphite/apache web frontend. It reads the data from the same NFS share as server1.

This config was running for 6 months without problems. At some point we
had a NFS problem and the NFS share was not accessible anymore. Data was
stopped being written to disk at that point. Carbon-cache process
crashed after a few days.

After NFS was restored, a few days later, server1 was rebooted. Same for
server 2: rebooted.

But after a few days the carbon-cache crashed again. In the last week it crashes after 1-2 days of running, but it does not write to the wsp files anymore. It crashes when the memory gets full, since all the metrics get cached in memory. It seems it is no longer able to flush to disk (over NFS). There were no configuration changes whatsoever.
I checked on server1 using "lsof" to see what files are open, and immediately after reboot it changes many files in the next few minutes, but then it halts writing to files, and starts caching in memory again. It seems the carbon-cache stops writing to disk. There is always one file that remains open (not always the same file) after the initial few minutes following the carbon-cache restart. Currently this file is open since the last 3 days:

lsof  /cap/
COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
carbon-ca 1969 apache   13u   REG   0,20 10017256 4517756010 /cap/metrics/server/sli2268_sli_bz/df-mapper_vgos_sli2268-lv_opt/df_inodes-free.wsp (nfssrv:/ifs/data/CAP)

I already checked/tried:
- rebooting both servers a few times.
- validate all the wsp files using "/opt/SLI/bin/whisper-info.py" and they are all fine, not corrupted.
- validate that NFS is working fine
- I did not add any new server to monitor since the problem started a few weeks ago, so all the metrics that are received and the load is the same as it was before the external  NFS/network issue.
- Tried "MAX_CACHE_SIZE = 350000" and/or "MAX_UPDATES_PER_SECOND = 5000" in carbon.conf file to force flushing data faster to disk, but still not writing to all files. Originally MAX_CACHE_SIZE = inf (and it worked fine for 6 month like that, with the same load!!!)

Questions:
1. How can I enable DEBUG mode to the carbon-cache so I can see what files it opens for writing and when it closes them? I already enabled "LOG_UPDATES = True" in carbon.conf, but it does not log to much info in it.
The content of: /opt/SLI/graphite/storage/log/carbon-cache/carbon-cache-a/console.log is:

21/02/2014 14:30:03 :: Log opened.
21/02/2014 14:30:03 :: twistd 13.0.0 (/opt/SLI/bin/python 2.7.4) starting up.
21/02/2014 14:30:03 :: reactor class: twisted.internet.epollreactor.EPollReactor.
21/02/2014 14:30:03 :: ServerFactory starting on 2003
21/02/2014 14:30:03 :: Starting factory <twisted.internet.protocol.ServerFactory instance at 0x1b90e18>
21/02/2014 14:30:03 :: ServerFactory starting on 2004
21/02/2014 14:30:03 :: Starting factory <twisted.internet.protocol.ServerFactory instance at 0x1b90488>
21/02/2014 14:30:03 :: ServerFactory starting on 7002
21/02/2014 14:30:03 :: Starting factory <twisted.internet.protocol.ServerFactory instance at 0x1b84488>
21/02/2014 14:30:03 :: set uid/gid 48/48
21/02/2014 14:30:06 :: Sorted 881 cache queues in 0.000547 seconds
21/02/2014 14:30:36 :: Sorted 14681 cache queues in 0.008075 seconds

--no more info in the log file after this for days--


2. Is there a way to find out why the process caches so much data in memory without writing datapoints to files on disk?

3. Why the carbon-cache seems to write quickly many files just after it
is restarted and after a few minutes it keeps a file open and hangs on
writing any other files? Memory continues to grow until it gets full and
process crashes.

If anybody has any idea or needs more info please let me know and I will
provide it.

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.