← Back to team overview

graphite-dev team mailing list archive

[Question #229112]: Graphite is not stable / losing connection to caches

 

New question #229112 on Graphite:
https://answers.launchpad.net/graphite/+question/229112

Hello!
We got pretty powerful server (8 cores/300GB RAM/fast storage) for Graphite, having one relay and 4 cache instances with consistent hashing, no aggregators used. Graphite-web with Apache live also on same server.  Now we have about ~850K metrics/min coming and ~20-30K cache queries/min according so stats. Most time server working fine, but latest time we start having some problem.
It looks like 1, or 2 cache instances suddenly stops working - we get drops in graphs.
In relay log we see:
=========================================
17/05/2013 14:41:44 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204)
17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionMade
17/05/2013 14:41:44 :: [listener] MetricLineReceiver connection with 10.32.232.11:47637 established
17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly.
17/05/2013 14:41:44 :: [console] <twisted.internet.tcp.Connector instance at 0x2fafab8> will retry in 5 seconds
17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly.
17/05/2013 14:41:44 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
=========================================
repeating continuosly.
Cache log file:
=========================================
17/05/2013 14:26:39 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:26:58 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:26:58 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204)
17/05/2013 14:27:32 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly.
17/05/2013 14:27:32 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly.
17/05/2013 14:27:32 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
=========================================
I.e. cache instance reconnecting to relay continously, but for some reason without success. Error logs are empty.
Restarting of cache instance did not helps, only after restarting relay it normalizes, but repeating after 5-10 hours.

Maybe we need performance problems, but system looks quite idle: 
=========================================
top - 15:12:44 up 22:32,  9 users,  load average: 6.58, 7.15, 7.01
Tasks: 269 total,   3 running, 266 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.6%us,  0.8%sy,  0.0%ni, 69.6%id, 11.9%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  297175992k total, 39414336k used, 257761656k free,   263712k buffers
Swap:  2568188k total,        0k used,  2568188k free, 30566700k cached
=========================================
and also another server with similar configuration, but 24GB ram and slower storage working fine on  400K metrics/min without any problems...

Configs are below:
carbon.conf
=========================================
[cache]
LOG_DIR = /opt/graphite/log
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = 5000
LINE_RECEIVER_INTERFACE = 0.0.0.0
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = True
WHISPER_LOCK_WRITES = True
USE_WHITELIST = True
[cache:a]
LINE_RECEIVER_PORT = 2103
PICKLE_RECEIVER_PORT = 2104
CACHE_QUERY_PORT = 7102
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_PORT = 7202
[cache:c]
LINE_RECEIVER_PORT = 2303
PICKLE_RECEIVER_PORT = 2304
CACHE_QUERY_PORT = 7302
[cache:d]
LINE_RECEIVER_PORT = 2403
PICKLE_RECEIVER_PORT = 2404
CACHE_QUERY_PORT = 7402
[relay]
USER =
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b, 127.0.0.1:2304:c, 127.0.0.1:2404:d
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 500000
USE_FLOW_CONTROL = True
[aggregator]
USER =
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
DESTINATIONS = 127.0.0.1:2104:a
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 200000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5
USE_WHITELIST = True

storage-schema.conf
=========================================
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[default_1_min_30_days_15_min_1_year_1hour_5years_24hours_10years]
priority = 100
pattern = .*
retentions = 60:30d,900:1y,3600:5y,90000:10y

blacklist.conf
=========================================
.*5MinuteRate
.*75percentile
.*98percentile
.*99percentile
.*999percentile

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.