← Back to team overview

graphite-dev team mailing list archive

[Question #660458]: Missing Metrics in Carbon Cache

 

New question #660458 on Graphite:
https://answers.launchpad.net/graphite/+question/660458

We are seeing some missing metrics (null)  from the cache when we query from graphite-web but some data is retrieved from the cache and displayed.  Once everything is flushed to disk all metrics appear.  Both members of the cluster exhibit the same behavior.

We have a new Graphite Cluster of two servers, each server is  running carbon-relay, one carbon cache instance per server and graphite-web sitting behind a load balancer.  We are using consistent-hashing with a replication-factor of 2 so that both servers have the same data.

Carbon-cache settles between 100k-300k metrics, the fewer metrics in cache results in fewer metrics missing.  Restarts of carbon-cache will mask the issue for a few hours as the cache grows.  We are writing ~275k metrics per minute.

The server referenced below uploads metrics every 5 seconds, matching what is set in the storage-scheme.conf.  We have other servers that upload metrics every 30 seconds and the behavior is similar.

Attached below are some of the configs and logs illustrating the issue.  If any other info is needed please let me know and any assistance is greatly appreciated.

tail -f query.log
07/11/2017 12:40:43 :: [127.0.0.1:36494] cache query for "web.servers.web.HOST1.perf.processor.pct_processor_time" returned 7 values

Below the oldest ~16 metrics have been persisted to disk.. all showing up correctly in graphite-web.  We then have ~21 null metrics that are not being pulled from the cache but I presume to be there since they get written to the disk a few moments later.  We then have the newest 7 metrics being pulled from the cache and shown in graphite-web correctly. 

curl "http://SERVER1/render/?target=web.servers.web.HOST1.perf.processor.pct_processor_time&format=json";
[7.0, 1510079840], [6.0, 1510079845], [4.0, 1510079850], [5.0, 1510079855], [5.0, 1510079860], [8.0, 1510079865], [5.0, 1510079870], [6.0, 1510079875], [9.0, 1510079880], [5.0, 1510079885], [4.0, 1510079890], [3.0, 1510079895], [4.0, 1510079900], [3.0, 1510079905], [null, 1510079910], [null, 1510079915], [null, 1510079920], [null, 1510079925], [null, 1510079930], [null, 1510079935], [null, 1510079940], [null, 1510079945], [null, 1510079950], [null, 1510079955], [null, 1510079960], [null, 1510079965], [null, 1510079970], [null, 1510079975], [null, 1510079980], [null, 1510079985], [null, 1510079990], [null, 1510079995], [null, 1510080000], [null, 1510080005], [4.0, 1510080010], [8.0, 1510080015], [6.0, 1510080020], [6.0, 1510080025], [4.0, 1510080030], [5.0, 1510080035], [4.0, 1510080040]],

python ~/whisper-info.py /opt/graphite/storage/whisper/web/servers/web/HOST1/perf/processor/pct_processor_time.wsp
maxRetention: 63072000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 10172212

Archive 0
retention: 2592000
secondsPerPoint: 5
points: 518400
size: 6220800
offset: 52

Archive 1
retention: 15552000
secondsPerPoint: 60
points: 259200
size: 3110400
offset: 6220852

Archive 2
retention: 63072000
secondsPerPoint: 900
points: 70080
size: 840960
offset: 9331252

carbon.conf
[cache]
DATABASE = whisper
ENABLE_LOGROTATION = True
USER =
MAX_CACHE_SIZE = 5000000
MAX_UPDATES_PER_SECOND = 750
MAX_CREATES_PER_MINUTE = 500
MIN_TIMESTAMP_RESOLUTION = 1
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2004
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2004
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2005
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CREATES = False
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60
GRAPHITE_URL = http://127.0.0.1:80
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 2
DESTINATIONS = 127.0.0.1:2005, 10.1.1.12:2005
MAX_QUEUE_SIZE = 100000
MAX_DATAPOINTS_PER_MESSAGE = 500
QUEUE_LOW_WATERMARK_PCT = 0.8
TIME_TO_DEFER_SENDING = 0.0001
USE_FLOW_CONTROL = True
USE_RATIO_RESET=False
MIN_RESET_STAT_FLOW=1000
MIN_RESET_RATIO=0.9
MIN_RESET_INTERVAL=121
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
FORWARD_ALL = True
DESTINATIONS = 127.0.0.1:2004
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5

local_settings.py
SECRET_KEY = 'Edited'
CLUSTER_SERVERS = ["10.1.1.12:80"]
REMOTE_FIND_TIMEOUT = 3.0           # Timeout for metric find requests
REMOTE_FETCH_TIMEOUT = 3.0          # Timeout to fetch series data
REMOTE_RETRY_DELAY = 60.0           # Time before retrying a failed remote webapp

-- 
You received this question notification because your team graphite-dev
is an answer contact for Graphite.