graphite-dev team mailing list archive

Thread
Date
[Question #276589]: Missing metrics in periodically in graphite

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: john <question276589@xxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 03 Dec 2015 03:47:47 -0000
Reply-to: question276589@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
New question #276589 on Graphite:
https://answers.launchpad.net/graphite/+question/276589

I am running a config of 3 servers behind a single load balancer.  Server A, B, C all run a carbon-relay with 2 carbon-caches (1 for each cpu as I have read in other documentation).  I am seeing an issue where a consistent metric is missing periodically and then will be written later.

example:
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 guest.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:31 idle.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 iowait.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 irq.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:31 nice.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 softirq.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 steal.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:31 system.wsp
-rw-r--r-- 1 graphite graphite 224680 Dec  3 03:34 user.wsp

You can see that idle, nice, and system cpu metrics are all behind by 3 minutes.  these metrics are delivered every 60 seconds and my storage-schema matches that.

This is only on server A.  Server B and C both have the metrics.  I am running the same configs on all 3 boxes.  One really interesting thing I have seen is the cache-b logs have a lot of queries, and cache-a logs have none.  Also, cache-a never showed a queue increase where cache-b shows a queue increase to 800.  I have been see fullqueuedrops but don't understand why.  

On the disk side I am running SSD and seeing the following from iostat.  I can provide more info if needed.

-sh-4.2$ iostat -d 1
Linux 3.10.0-229.14.1.el7.x86_64 (ip-10-110-1-18) 	12/03/2015 	_x86_64_	(2 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda            486.49         5.27      2171.49    4232842 1744289310

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              1.00         8.00         0.00          8          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda           2150.00         0.00      8600.00          0       8600

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda           3051.00         0.00     12232.00          0      12232

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda           2934.00         0.00     12984.00          0      12984

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda           1056.00         0.00      4228.00          0       4228

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00       

I am currently doing about 40k metrics / 60 seconds.  I'm really confused why I'm seeing a consistency in the missing metrics.  I thought if this was a queue or caching issue it could be random metrics.  Any help and direction would really be appreciated.
Thanks.

-- 
You received this question notification because your team graphite-dev
is an answer contact for Graphite.