← Back to team overview

graphite-dev team mailing list archive

[Question #252730]: Retry/ Re-hash when destination carbon daemons become unavailable

 

New question #252730 on Graphite:
https://answers.launchpad.net/graphite/+question/252730

Has any thought been put into the ability to Retry or Rehash Destinations when a backend carbon daemon goes down?

My concern is that in a cluster setup, there is potential for data loss when storage boxes (wherever your carbon daemons run) go down for any reason.

For example:

If I had a relay on one server receiving 100k metrics that were then being consistently hashed to 4 relays on other servers, it seems like there is a potential for loss. 

100k metrics>> 4 boxes @ ~25k each,
say MAX_QUEUE_SIZE is 20k

box 4 goes down so the primary relay starts to cache up to MAX_QUEUE_SIZE.

box 3 goes down so the primary relay starts to cache up to MAX_QUEUE_SIZE for this box too but is already full.

Then based on usage of flow control, metrics are potentially dropped on the floor as sockets are ignored.

 MAX_QUEUE_SIZE seems to be only useful when sending relatively small quantities of metrics in that it could fill very quickly if you are doing more. 

In the above case, I would hope for the primary relay to recognize the status of the daemons on boxes 3 and 4 and rehash their metrics to 1 and 2 so that there is no data loss.

Has anyone worked out a better solution for a larger scale cluster setup? Are there plans to add a retry or a command for the relays to rehash based on a modified destination list? I would imagine this would require some sort of "unaccessible destinations list" that a destination could be sent to in order to filter out unresponsive carbon daemons. Also, I would think that you would need some form of check in place against the destinations in order to determine whether they are viable candidates for writes or not and then would modify the aforementioned unaccessible list.

Just wanted to put the question out there before I try to write something.


-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.