graphite-dev team mailing list archive
-
graphite-dev team
-
Mailing list archive
-
Message #00326
Re: Cassandra backend
Wow, thanks for the performance numbers Kraig. Yes I am definitely
interested in learning more about Cassandra. I have not had a chance to try
it out yet myself but it sounds pretty cool and it is nice to see someone
has it running in a working system. I have also been hearing a lot of buzz
about Redis lately (http://code.google.com/p/redis/) and some users are
looking into using that as a carbon-cache replacement as well. Again I
haven't had a chance to mess around with it myself so I don't know all that
much about it. I am cc'ing the mailing list so we can open this up to a
broader discussion, possibly about Redis as well.
I am very curious about how Cassandra actually works and I am also curious
about the %iowait numbers from your testing. In particular the %iowait times
are nearly zero on the Cassandra servers. Presumably Cassandra is getting
everything persisted to disk with consistent performance over time? If it is
then I think it would be a good idea to test a scenario where there is both
a lot of metrics being written as well as a lot of metrics being
retrieved. The reason this is important is because there are two I/O access
patterns heavily used by Graphite that are in complete conflict with one
another.
On the one hand you have carbon persisting datapoints that are the identical
or very close to one another in terms of time but all very different from
one another in terms of metric name (thus it would be optimal to organize
your datapoints by time, ie. just write datapoints one after the other as
you receive them). But on the other hand you have the Graphite webapp
fetching many temporally contiguous datapoints at a time for relatively few
different metrics (thus it would be optimal to organize your datapoints by
metric name, so when you fetch 500 datapoints for one metric they are all in
the same place). With the current whisper solution we have a wsp file for
each metric and thus we are effectively taking the latter approach and
indexing our datapoints by metric name. This means fetching datapoints is
very efficient because you can fetch contiguous blocks of datapoints with
very few I/O operations. But this also means that our other use case,
writing datapoints, is far less efficient because we have to do many many
small writes to many many different files, causing lots of seeking and hence
I/O wait time. Fortunately we have a nice way to side-step this problem by
caching and organizing incoming datapoints in such a way that we can
coalesce more datapoints into the same number of writes, the trade off being
that we have to cache a decent amount of data before this efficiency booster
kicks in, which means there is a delay of possibly several minutes before
data actually gets written to disk. This would be bad if our ultimate goal
was getting datapoints written to disk quickly but fortunately it isn't, our
goal is simply realtime graphs. So we have the webapp pull data directly
from carbon's cache and voila we have real-time graphs, very efficient
reads, and very inefficient but essentially fixed write performance (in
terms of I/O resources).
*Phew*... so. The reason I mention all that is because that is my
understanding of Graphite's current performance characteristics. Before I'd
want to move to something else like Cassandra I would like to gain a
sufficiently detailed understanding of how it actually works as well. Not
that we need to become low-level Cassandra experts or anything, but I'd like
to be able to answer some high-level questions like what use cases does
Cassandra optimize for in terms of how it indexes datapoints? Can it index
by metric name? By time? Both? Neither?
If Cassandra's I/O wait time is consistently low then I would be willing to
bet that the data is primarily organized by time (ie. just write it as you
receive it). This is a great way of avoiding seeking for writes and thus you
get low I/O wait time (for writes), but it has the cost of being much less
efficient at reads (ie. you'd have to seek all over to find datapoints for
the past 24 hours for a single metric). This would essentially be the
opposite strategy currently taken with whisper. Who knows, maybe it is
better. It may just depend on how intensive your read/write loads are
respectively. Maybe it has some fancy caching ninjitsu In any case I think
the goal at this point would be to better understand how the system performs
under various circumstances and why. Hopefully there are some general
Cassandra benchmarks out there we can find to help with this.
I think that in general many of the cool new
ultra-distributed-cachtastic-uberscaling-nosql databases like Cassandra,
Redis, CouchDB, HBase, TokyoCabinet, MongoDB, etc... have some fantastic
ideas behind them, but I would like to avoid the trap that I have seen many
of my friends fall into over the years of trusting in the magical
performance abilities of an underlying product lower in the stack. Because
Graphite aims for high performance and scalability, the low-level details of
storage will always matter to some degree (unless you throw cost out the
window, in which case just buy enough hardware and I'm sure any solution
could scale). That said, I am all in favor of finding external products to
replace components of Graphite with. Less code for me to maintain :)
Regarding your notes on the hierarchical data issue, that is a hurdle with
pretty much every approach I've seen that doesn't involve the filesystem,
but I think maintaining a separate tree structure specifically for doing
lookups/browsing (like just keeping empty files on the filesystem or
whatever) would be a fine way to handle it, keeping it in sync shouldn't be
too hard of a problem to tackle.
Regarding your notes on the ease of scaling by adding Cassandra nodes / Rack
awareness... that sounds wonderful. Currently it can be a big pain in the
ass to add a server to an existing Graphite cluster if your data is sharded.
IMHO this would be a top reason for using an alternative storage system.
Regarding the problem you ran into with 100k datapoints per minute, this
isn't a carbon/twisted problem because I currently have two production
systems each sustaining 300k datapoints per minute on comparable hardware.
One big optimization though is to send carbon lists of metrics via its
pickle listener instead of using the plain text method.
*Phew* again... I just realized this is why I don't blog, I get way too
carried away when I write :)
Anyways, I look forward to hearing your thoughts.
-Chris
On Wed, Mar 17, 2010 at 6:39 PM, Kraig Amador <kamador@xxxxxxxxxxxxx> wrote:
> Hi Chris.
>
> I've been POC'ing replacing the whisper backend with a Cassandra based
> backend. I'm currently dumping keys straight into Cassandra without any kind
> of archiving schemas applied. I whipped up a .wsp dumper to preload my
> Cassandra servers.
>
> It is FAST.
>
> Whisper backend on a DELL 2950, 4 SAS drives in RAID 5. committedPoints
> hovering around 40K
> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx (3.80)# iostat
> Linux 2.6.18-128.1.6.el5 (graphite001.sl1.shopzilla.hou) 03/17/10
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4.76 0.01 1.75 13.70 0.00 79.78
>
> Cassandra backend on 2 SGI rackable nodes, Blade servers, 2 SATA drives in
> RAID 1. Carbon and graphite are running on another one of these nodes.
> comittePoints hovering around 76K
>
> Graphite node:
> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.46)# iostat
> Linux 2.6.27-briullov.1 (cloudvzdev001.sl2.shopzilla.sea) 03/17/10
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.07 0.00 0.04 0.01 0.00 99.88
>
> Cassandra node 1:
> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.56)# iostat
> Linux 2.6.27-briullov.1 (cloudvzdev005.sl2.shopzilla.sea) 03/17/10
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.07 0.00 0.05 0.01 0.00 99.87
>
> Cassandra node 2:
> ---> root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (0.12)# iostat
> Linux 2.6.27-briullov.1 (cloudvzdev003.sl1.shopzilla.sea) 03/17/10
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.08 0.00 0.04 0.01 0.00 99.87
>
>
> The second Cassandra node was fired up just to test its clustering and
> replication. It made no performance difference.
>
> I've tried increasing my loader speed but when I hit around 100K carbon
> tends to blow up. I think this is Twisted as the cassandra nodes don't seem
> to have any problems with this kind of load.
>
>
>
> So, the issues with Cassandra.
>
>
> - It does not do hierarchical data very well. I really only replaced
> the fetch and update calls in whisper.py. I'm using the file structure to
> browse in graphite and my key names in Cassandra are the actual file paths.
> I'm undecided on the best way to solve this one. It seems like a dirty trick
> to try and force the tree data into Cassandra.
> - There will need to be a cleanup process that runs and clears out old
> cache values and consolidates old values depending on the storage schemas.
>
>
> Benefits?
>
> - It's fast
> - Data fetching could be optimized, for example if the image is 900
> pixels wide but you are fetching over a time span that has 5000 data points,
> fetch a subset of the data. The only issue here is some loss of accuracy.
> - If you need more space/performance, you can just fire up another
> Cassandra node and linearly scale your IO.
> - Data replication that is datacenter and rack aware. We have a colo in
> London that I currently hit to show our London metrics. This could be
> replicated to our LA office without having to write any code.
>
>
>
> So, any interest?
>
>
>
> Attached files:
> writer.py: removed metric sorting and context.pickle, touch the .wsp file
> so the webapp knows data exists.
> whisper.py: replaced with cassandra functions
>
> Cassandra storage conf:
> <Keyspaces>
> <Keyspace Name="Graphite">
> <ColumnFamily CompareWith="BytesType" Name="DataPoints"/>
>
> <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
> <ReplicationFactor>1</ReplicationFactor>
>
> <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
> </Keyspace>
> </Keyspaces>
>
>
Follow ups