← Back to team overview

graphite-dev team mailing list archive

Re: Cassandra backend

 

Kraig,

That is really awesome. We've been evaluating Graphite at my current job.
We've been looking at possibly replacing the carbon backend with a different
backend that was non hierarchical (i.e. limited to what a directory
structure can support). We'd like for each data file to be tagged with
multiple pieces of meta data that can then be mixed in more complex ways
than what is currently possible with Graphite's tree browser. For example we
might want to know that metric "foo.bar" was from machine X, application Y
plus several other arbitrary pieces of data. All the meta data is optional
and therefore not very convenient in a rigid tree structure. The main
purpose of the meta data/tags is to filter different series to view
together. If cassandra provides good performance and frees us structurally,
it might be a good candidate to evaluate.

I'm not sure if others would be interested in something like this as well.

Matt

On Thu, Mar 18, 2010 at 1:01 AM, Chris Davis <chrismd@xxxxxxxxx> wrote:

> Wow, thanks for the performance numbers Kraig. Yes I am definitely
> interested in learning more about Cassandra. I have not had a chance to try
> it out yet myself but it sounds pretty cool and it is nice to see someone
> has it running in a working system. I have also been hearing a lot of buzz
> about Redis lately (http://code.google.com/p/redis/) and some users are
> looking into using that as a carbon-cache replacement as well. Again I
> haven't had a chance to mess around with it myself so I don't know all that
> much about it. I am cc'ing the mailing list so we can open this up to a
> broader discussion, possibly about Redis as well.
>
> I am very curious about how Cassandra actually works and I am also curious
> about the %iowait numbers from your testing. In particular the %iowait times
> are nearly zero on the Cassandra servers. Presumably Cassandra is getting
> everything persisted to disk with consistent performance over time? If it is
> then I think it would be a good idea to test a scenario where there is both
> a lot of metrics being written as well as a lot of metrics being
> retrieved. The reason this is important is because there are two I/O access
> patterns heavily used by Graphite that are in complete conflict with one
> another.
>
> On the one hand you have carbon persisting datapoints that are the
> identical or very close to one another in terms of time but all very
> different from one another in terms of metric name (thus it would be optimal
> to organize your datapoints by time, ie. just write datapoints one after the
> other as you receive them). But on the other hand you have the Graphite
> webapp fetching many temporally contiguous datapoints at a time for
> relatively few different metrics (thus it would be optimal to organize your
> datapoints by metric name, so when you fetch 500 datapoints for one metric
> they are all in the same place). With the current whisper solution we have a
> wsp file for each metric and thus we are effectively taking the latter
> approach and indexing our datapoints by metric name. This means fetching
> datapoints is very efficient because you can fetch contiguous blocks of
> datapoints with very few I/O operations. But this also means that our other
> use case, writing datapoints, is far less efficient because we have to do
> many many small writes to many many different files, causing lots of seeking
> and hence I/O wait time. Fortunately we have a nice way to side-step this
> problem by caching and organizing incoming datapoints in such a way that we
> can coalesce more datapoints into the same number of writes, the trade off
> being that we have to cache a decent amount of data before this efficiency
> booster kicks in, which means there is a delay of possibly several minutes
> before data actually gets written to disk. This would be bad if our ultimate
> goal was getting datapoints written to disk quickly but fortunately it
> isn't, our goal is simply realtime graphs. So we have the webapp pull data
> directly from carbon's cache and voila we have real-time graphs, very
> efficient reads, and very inefficient but essentially fixed write
> performance (in terms of I/O resources).
>
> *Phew*... so. The reason I mention all that is because that is my
> understanding of Graphite's current performance characteristics. Before I'd
> want to move to something else like Cassandra I would like to gain a
> sufficiently detailed understanding of how it actually works as well. Not
> that we need to become low-level Cassandra experts or anything, but I'd like
> to be able to answer some high-level questions like what use cases does
> Cassandra optimize for in terms of how it indexes datapoints? Can it index
> by metric name? By time? Both? Neither?
>
> If Cassandra's I/O wait time is consistently low then I would be willing to
> bet that the data is primarily organized by time (ie. just write it as you
> receive it). This is a great way of avoiding seeking for writes and thus you
> get low I/O wait time (for writes), but it has the cost of being much less
> efficient at reads (ie. you'd have to seek all over to find datapoints for
> the past 24 hours for a single metric). This would essentially be the
> opposite strategy currently taken with whisper. Who knows, maybe it is
> better. It may just depend on how intensive your read/write loads are
> respectively. Maybe it has some fancy caching ninjitsu In any case I think
> the goal at this point would be to better understand how the system performs
> under various circumstances and why. Hopefully there are some general
> Cassandra benchmarks out there we can find to help with this.
>
> I think that in general many of the cool new
> ultra-distributed-cachtastic-uberscaling-nosql databases like Cassandra,
> Redis, CouchDB, HBase, TokyoCabinet, MongoDB, etc... have some fantastic
> ideas behind them, but I would like to avoid the trap that I have seen many
> of my friends fall into over the years of trusting in the magical
> performance abilities of an underlying product lower in the stack. Because
> Graphite aims for high performance and scalability, the low-level details of
> storage will always matter to some degree (unless you throw cost out the
> window, in which case just buy enough hardware and I'm sure any solution
> could scale). That said, I am all in favor of finding external products to
> replace components of Graphite with. Less code for me to maintain :)
>
> Regarding your notes on the hierarchical data issue, that is a hurdle with
> pretty much every approach I've seen that doesn't involve the filesystem,
> but I think maintaining a separate tree structure specifically for doing
> lookups/browsing (like just keeping empty files on the filesystem or
> whatever) would be a fine way to handle it, keeping it in sync shouldn't be
> too hard of a problem to tackle.
>
> Regarding your notes on the ease of scaling by adding Cassandra nodes /
> Rack awareness... that sounds wonderful. Currently it can be a big pain in
> the ass to add a server to an existing Graphite cluster if your data is
> sharded. IMHO this would be a top reason for using an alternative storage
> system.
>
> Regarding the problem you ran into with 100k datapoints per minute, this
> isn't a carbon/twisted problem because I currently have two production
> systems each sustaining 300k datapoints per minute on comparable hardware.
> One big optimization though is to send carbon lists of metrics via its
> pickle listener instead of using the plain text method.
>
> *Phew* again... I just realized this is why I don't blog, I get way too
> carried away when I write :)
>
> Anyways, I look forward to hearing your thoughts.
>
> -Chris
>
>
> On Wed, Mar 17, 2010 at 6:39 PM, Kraig Amador <kamador@xxxxxxxxxxxxx>wrote:
>
>> Hi Chris.
>>
>> I've been POC'ing replacing the whisper backend with a Cassandra based
>> backend. I'm currently dumping keys straight into Cassandra without any kind
>> of archiving schemas applied. I whipped up a .wsp dumper to preload my
>> Cassandra servers.
>>
>> It is FAST.
>>
>> Whisper backend on a DELL 2950, 4 SAS drives in RAID 5. committedPoints
>> hovering around 40K
>> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx (3.80)# iostat
>> Linux 2.6.18-128.1.6.el5 (graphite001.sl1.shopzilla.hou)        03/17/10
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            4.76    0.01    1.75   13.70    0.00   79.78
>>
>> Cassandra backend on 2 SGI rackable nodes, Blade servers, 2 SATA drives in
>> RAID 1. Carbon and graphite are running on another one of these nodes.
>> comittePoints hovering around 76K
>>
>> Graphite node:
>> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.46)# iostat
>> Linux 2.6.27-briullov.1 (cloudvzdev001.sl2.shopzilla.sea)       03/17/10
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.07    0.00    0.04    0.01    0.00   99.88
>>
>> Cassandra node 1:
>> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.56)# iostat
>> Linux 2.6.27-briullov.1 (cloudvzdev005.sl2.shopzilla.sea)       03/17/10
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.07    0.00    0.05    0.01    0.00   99.87
>>
>> Cassandra node 2:
>> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.12)# iostat
>> Linux 2.6.27-briullov.1 (cloudvzdev003.sl1.shopzilla.sea)       03/17/10
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.08    0.00    0.04    0.01    0.00   99.87
>>
>>
>> The second Cassandra node was fired up just to test its clustering and
>> replication. It made no performance difference.
>>
>> I've tried increasing my loader speed but when I hit around 100K carbon
>> tends to blow up. I think this is Twisted as the cassandra nodes don't seem
>> to have any problems with this kind of load.
>>
>>
>>
>> So, the issues with Cassandra.
>>
>>
>>    - It does not do hierarchical data very well. I really only replaced
>>    the fetch and update calls in whisper.py. I'm using the file structure to
>>    browse in graphite and my key names in Cassandra are the actual file paths.
>>    I'm undecided on the best way to solve this one. It seems like a dirty trick
>>    to try and force the tree data into Cassandra.
>>    - There will need to be a cleanup process that runs and clears out old
>>    cache values and consolidates old values depending on the storage schemas.
>>
>>
>> Benefits?
>>
>>    - It's fast
>>    - Data fetching could be optimized, for example if the image is 900
>>    pixels wide but you are fetching over a time span that has 5000 data points,
>>    fetch a subset of the data. The only issue here is some loss of accuracy.
>>    - If you need more space/performance, you can just fire up another
>>    Cassandra node and linearly scale your IO.
>>    - Data replication that is datacenter and rack aware. We have a colo
>>    in London that I currently hit to show our London metrics. This could be
>>    replicated to our LA office without having to write any code.
>>
>>
>>
>> So, any interest?
>>
>>
>>
>> Attached files:
>> writer.py: removed metric sorting and context.pickle, touch the .wsp file
>> so the webapp knows data exists.
>> whisper.py: replaced with cassandra functions
>>
>> Cassandra storage conf:
>>   <Keyspaces>
>>     <Keyspace Name="Graphite">
>>         <ColumnFamily CompareWith="BytesType" Name="DataPoints"/>
>>
>>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
>>         <ReplicationFactor>1</ReplicationFactor>
>>
>>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>>     </Keyspace>
>>   </Keyspaces>
>>
>>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev<https://launchpad.net/%7Egraphite-dev>
> Post to     : graphite-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~graphite-dev<https://launchpad.net/%7Egraphite-dev>
> More help   : https://help.launchpad.net/ListHelp
>
>

References