graphite-dev team mailing list archive

Thread
Date

[Question #240699]: Webapp not querying the correct carbon-cache instance even after specifying all the instances in CARBONLINK_HOSTS(taking care of the order in which they are specified)

To: graphite-dev@xxxxxxxxxxxxxxxxxxx
From: pratX <question240699@xxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 12 Dec 2013 17:06:18 -0000
Reply-to: question240699@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

New question #240699 on Graphite:
https://answers.launchpad.net/graphite/+question/240699

In our Graphite(version 0.9.9) setup there are two carbon-relays and a total of 11 carbon-cache instances. There are three carbon-cache instances behind one of the relays and two carbon-cache instances behind the other one. All the relays and caches are running in one single box. This seemingly wierd setup came into existence over an extended period of time; initially one carbon-cache instance was dedicated to each team to send their metrics. Then few teams started sending a lot of metrics, so the number of carbon-cache instances was increased for them, putting these behind a relay and using consisten hashing to distribute metrics across carbon-cache instances.
Due to increased number of metrics our Graphite box started limiting on disk I/O. On some googling we found that, we will have to go for Flashcache. Besides, during the course of our googling we encountered,
https://answers.launchpad.net/graphite/+question/178969
The thing that caught our attention here was batch-update of whisper files. We felt that this would improve disk performance. The box had no dearth of memory, so we could afford keeping metrics in memory.
So we modified writer.py to make carbon-cache do a whisper.update_many() only for those metrics which have more than a certain configurable(adding a line in conf.py) threshold number of datapoints in carbon-cache. And this threshold is removed when the cache size grows above 80% of MAX_CACHE_SIZE, and comes back again when cache size reduces. The result has been quite encouraging; before the change, disk utilization used to very frequently reach 100% and persist there for sometime. After the change, the average disk utilization is almost half and it still reaches 100% but much less frequently and persists there for much less time.
But now there are a lot of metrics in cache. We have listed all the carbon-cache instances in CARBONLINK_HOSTS, but we get the most recent datapoints of very few metrics. For most of the metrics we get only that many datapoints in graph as have been written to whisper files. We figured out the reason for this as well:
the graphite webapp makes a hash-ring of the instances specified in CARBONLINK_HOSTS. For each metric in a query, hash of the metric-name is computed, which determines which carbon-cache instance would be queried for the metric. In our setup, few carbon-cache instances are behind one relay, few are behind the other relay and the rest are not behind any relay. So an incoming metric reaches a carbon-cache either directly or via a hash-ring formed by a subset of carbon-cache instances. During query on the other hand, there is a hash-ring made up of all the carbon-cache instances.
For graphs to be useful to the respective teams, it is important for them to show the most recent datapoints. So for now, we have modified the webapp to query each carbon-cache instance in CARBONLINK_HOSTS for each metric (in the same order as given in CARBONLINK_HOSTS). But this increases the query-time for each metric. Need some better way to do it.
Secondly, given the way the webapp determines the the carbon-cache instance to query, how would the correct instance be queried in case relay-rules are used instead of consistent-hashing?

--
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.