← Back to team overview

graphite-dev team mailing list archive

[Question #218542]: Data gaps in a 2x graphite-relay + 4x carbon-cache setup

 

New question #218542 on Graphite:
https://answers.launchpad.net/graphite/+question/218542

Hi there,

I've inherited an existing Graphite installation and am having problems making it work reliably : there is missing data in our graphs and I can't seem to pinpoint the problem so far.

The setup lives on a single host and is 2x graphite-relay instances in consistent hash mode, both using the same 4 carbon-caches as backends, which are in turn writing to the same whisper database that resides on a 4x RAID0 array of local (ephemeral) storage on an Amazon instance.

A collectd daemon on each of our hosts pushes systems metrics to the 1st graphite relay instance.
A custom central script running on the graphite box pulls application metrics from our hosts, and sends them every minute to the 2nd graphite-relay.

First question : does anything in this setup raise your eyebrows so far ?

avgUpdatetime in the carbon stats is around 0.0005 for each carbon instance, MAX_UPDATES_PER_SECOND is 300.
graphite-relay doesn't seem to be making use of its overflow log and the load on the box isn't too crazy.

Data gaps seem to happen exclusively on the app metrics sent to the 2nd relay.
Internal Carbon stats are 100% fine, for example.

I've tried to follow what happens to a particular app metric :
I've verified that our custom script successfully gets data from the hosts every minute, and at least manages to send the correct number of bytes to graphite-relay over the socket (Python's socket.send always reports the same length transmitted as the message's original size). I hope that's proof enough that graphite-relay has got the data.

Here's what a half hour of logs from the sending script look like for this particular metric : 
2013/01/07/143001.log:graphite-DEBUG-Path: <metric name>, value:1156601, sent 61 of 61 bytes
2013/01/07/143101.log:graphite-DEBUG-Path: <metric name>, value:1156628, sent 61 of 61 bytes
2013/01/07/143201.log:graphite-DEBUG-Path: <metric name>, value:1156649, sent 61 of 61 bytes
2013/01/07/143301.log:graphite-DEBUG-Path: <metric name>, value:1156690, sent 61 of 61 bytes
2013/01/07/143401.log:graphite-DEBUG-Path: <metric name>, value:1156726, sent 61 of 61 bytes
2013/01/07/143501.log:graphite-DEBUG-Path: <metric name>, value:1156762, sent 61 of 61 bytes
2013/01/07/143601.log:graphite-DEBUG-Path: <metric name>, value:1156789, sent 61 of 61 bytes
2013/01/07/143701.log:graphite-DEBUG-Path: <metric name>, value:1156826, sent 61 of 61 bytes
2013/01/07/143801.log:graphite-DEBUG-Path: <metric name>, value:1156868, sent 61 of 61 bytes
2013/01/07/143901.log:graphite-DEBUG-Path: <metric name>, value:1156907, sent 61 of 61 bytes
2013/01/07/144001.log:graphite-DEBUG-Path: <metric name>, value:1156945, sent 61 of 61 bytes
2013/01/07/144101.log:graphite-DEBUG-Path: <metric name>, value:1156972, sent 61 of 61 bytes
2013/01/07/144201.log:graphite-DEBUG-Path: <metric name>, value:1157003, sent 61 of 61 bytes
2013/01/07/144302.log:graphite-DEBUG-Path: <metric name>, value:1157039, sent 61 of 61 bytes
2013/01/07/144401.log:graphite-DEBUG-Path: <metric name>, value:1157101, sent 61 of 61 bytes
2013/01/07/144501.log:graphite-DEBUG-Path: <metric name>, value:1157130, sent 61 of 61 bytes
2013/01/07/144601.log:graphite-DEBUG-Path: <metric name>, value:1157163, sent 61 of 61 bytes
2013/01/07/144701.log:graphite-DEBUG-Path: <metric name>, value:1157184, sent 61 of 61 bytes
2013/01/07/144801.log:graphite-DEBUG-Path: <metric name>, value:1157218, sent 61 of 61 bytes
2013/01/07/144901.log:graphite-DEBUG-Path: <metric name>, value:1157253, sent 61 of 61 bytes
2013/01/07/145001.log:graphite-DEBUG-Path: <metric name>, value:1157289, sent 61 of 61 bytes
2013/01/07/145101.log:graphite-DEBUG-Path: <metric name>, value:1157338, sent 61 of 61 bytes
2013/01/07/145201.log:graphite-DEBUG-Path: <metric name>, value:1157402, sent 61 of 61 bytes
2013/01/07/145301.log:graphite-DEBUG-Path: <metric name>, value:1157447, sent 61 of 61 bytes
2013/01/07/145401.log:graphite-DEBUG-Path: <metric name>, value:1157483, sent 61 of 61 bytes
2013/01/07/145501.log:graphite-DEBUG-Path: <metric name>, value:1157535, sent 61 of 61 bytes
2013/01/07/145601.log:graphite-DEBUG-Path: <metric name>, value:1157573, sent 61 of 61 bytes
2013/01/07/145701.log:graphite-DEBUG-Path: <metric name>, value:1157608, sent 61 of 61 bytes
2013/01/07/145801.log:graphite-DEBUG-Path: <metric name>, value:1157648, sent 61 of 61 bytes
2013/01/07/145901.log:graphite-DEBUG-Path: <metric name>, value:1157695, sent 61 of 61 bytes

And here's what ends up in Whisper (output of whisper-fetch, with converted Unix timestamps)  :
Mon Jan 7 14:30:00 GMT 2013 1156601.000000
Mon Jan 7 14:31:00 GMT 2013 1156628.000000
Mon Jan 7 14:32:00 GMT 2013 None
Mon Jan 7 14:33:00 GMT 2013 None
Mon Jan 7 14:34:00 GMT 2013 1156726.000000
Mon Jan 7 14:35:00 GMT 2013 None
Mon Jan 7 14:36:00 GMT 2013 None
Mon Jan 7 14:37:00 GMT 2013 1156826.000000
Mon Jan 7 14:38:00 GMT 2013 1156868.000000
Mon Jan 7 14:39:00 GMT 2013 None
Mon Jan 7 14:40:00 GMT 2013 None
Mon Jan 7 14:41:00 GMT 2013 1156972.000000
Mon Jan 7 14:42:00 GMT 2013 1157003.000000
Mon Jan 7 14:43:00 GMT 2013 1157039.000000
Mon Jan 7 14:44:00 GMT 2013 1157101.000000
Mon Jan 7 14:45:00 GMT 2013 1157130.000000
Mon Jan 7 14:46:00 GMT 2013 None
Mon Jan 7 14:47:00 GMT 2013 None
Mon Jan 7 14:48:00 GMT 2013 1157218.000000
Mon Jan 7 14:49:00 GMT 2013 1157253.000000
Mon Jan 7 14:50:00 GMT 2013 None
Mon Jan 7 14:51:00 GMT 2013 None
Mon Jan 7 14:52:00 GMT 2013 1157402.000000
Mon Jan 7 14:53:00 GMT 2013 1157447.000000
Mon Jan 7 14:54:00 GMT 2013 None
Mon Jan 7 14:55:00 GMT 2013 None
Mon Jan 7 14:56:00 GMT 2013 None
Mon Jan 7 14:57:00 GMT 2013 None
Mon Jan 7 14:58:00 GMT 2013 None
Mon Jan 7 14:59:00 GMT 2013 None
Mon Jan 7 15:00:00 GMT 2013 1157728.000000

So, data gets lost between our script, graphite-relay, carbon-cache and Whisper. How should I go about debugging this further ?

FWIW, I've tried shutting down the 1st graphite relay in case having 2 relays writing to the carbon backends at the same time is an issue. This didn't help.

Any help much appreciated !

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.