← Back to team overview

openstack team mailing list archive

Re: availability/performance sensors/probes

 

If John Dickinson can steal me a 30 minute block at the conference I'll probably be giving a talk about it, but we (Rackspace) started switching to Graphite back in December. We're basically just following the etsy cookbook to "graph all the things!".  

We're using https://github.com/pandemicsyn/swift-informant to fire events to statsd. It takes care of answering questions like:

How many Object GET 200's are we currently getting per second.
How many container ops are we doing per second.
What was the average request time of container HEAD's between 4-5PM last tuesday (which always seems to lead to the question of why are they so much slower today…oh look that node is having a weird hw issue)?

Swift's also really good about dumping info to the error log. We convert the majority of those log lines to events thats get fired to statsd using https://github.com/pandemicsyn/statsdlog.

That lets us track everything from container-replicator timeouts, auth service retries, to OSError's on the object servers (think we're tracking about 25-30 log line patterns at the moment).

The last piece is just a hacked version of the swift-recon cli. It's what reports async-pending's, replication times, etc to graphite.

Right now it gets tied together by tiny hackish Flask app that generates some tv dashboard's and will probably start doing the monitoring/alerting for the traffic prediction/confidence bands (experimenting with just doing it with an irc bot).

--  
Florian Hines | @pandemicsyn
http://about.me/pandemicsyn


On Wednesday, February 22, 2012 at 2:50 AM, Jasper Capel wrote:

> I've uploaded the checks we use in production here at Spil Games to https://github.com/spilgames/swift. Besides check_swift (which is a functional test) everything's meant to gather statistics from the cluster and we're looking to replace that with a Graphite-based solution to avoid having to parse access logs and having more real-time metrics available. Nothing fancy, but it may be of use to someone.
>  
> Jasper
>  
>  
>  
> On Feb 21, 2012, at 11:54 PM, Tim Bell wrote:
>  
> >  
> > This does bring up a more generic problem of sharing the
> > availability/performance code for all of the OpenStack components.
> >  
> > At the design summit, this was proposed as one of the example use cases of
> > the OpenStack community forge (I forget the exact name) but it was intended
> > as a place for sharing code/procedures which were not intended to be part of
> > the core but may be of interest to others.
> >  
> > Was anything set up along these lines ?
> >  
> > A set of production quality Nagios/Ganglia sensors would be very interesting
> > if someone has these....
> >  
> > Tim
> >  
> > > -----Original Message-----
> > > From: openstack-bounces+tim.bell=cern.ch@xxxxxxxxxxxxxxxxxxx (mailto:cern.ch@xxxxxxxxxxxxxxxxxxx)
> > > [mailto:openstack-bounces+tim.bell=cern.ch@xxxxxxxxxxxxxxxxxxx (mailto:cern.ch@xxxxxxxxxxxxxxxxxxx)] On Behalf
> > > Of Jasper Capel
> > > Sent: 21 February 2012 18:29
> > > To: John Dickinson
> > > Cc: openstack@xxxxxxxxxxxxxxxxxxx (mailto:openstack@xxxxxxxxxxxxxxxxxxx)
> > > Subject: Re: [Openstack] swprobe: swift middleware for sending metrics to
> > > graphite using statsd
> > >  
> > > Hi John,
> > >  
> > > Apparently my google-fu is not up to snuff, as I wasn't aware of that
> > project.
> > > Had I been, I probably would've just extemded that one. :)
> > >  
> > > Cheers,
> > > Jasper
> > >  
> > > ________________________________________
> > > From: John Dickinson [me@xxxxxx (mailto:me@xxxxxx)]
> > > Sent: Tuesday, February 21, 2012 5:44 PM
> > > To: Jasper Capel
> > > Cc: openstack@xxxxxxxxxxxxxxxxxxx (mailto:openstack@xxxxxxxxxxxxxxxxxxx)
> > > Subject: Re: [Openstack] swprobe: swift middleware for sending metrics to
> > > graphite using statsd
> > >  
> > > That's great. Have you by any chance seen
> > > https://github.com/pandemicsyn/swift-informant? It's something similar
> > > that we've been playing with at Rackspace.
> > >  
> > > --John
> > >  
> > >  
> > > On Feb 21, 2012, at 10:36 AM, Jasper Capel wrote:
> > >  
> > > > Hi all,
> > > >  
> > > > I'm announcing a piece of Swift middleware, swprobe [1], designed to
> > > gather run-time metrics and ship them off to Graphite [2] for near
> > >  
> >  
> > real-time
> > > monitoring. Currently it sends out bytes up- and downloaded per account,
> > > http methods and response codes and timings in miliseconds on each call.
> > > >  
> > > > To be able to use this you need Graphite [2]. You also need statsd
> > running,
> > > preferably on the local machine since there potentially many small UDP
> > > packets are being sent out. Please also note that we have not yet tested
> > >  
> >  
> > this
> > > with production workloads.
> > > >  
> > > > [1] - https://github.com/spilgames/swprobe
> > > > [2] - http://graphite.wikidot.com/
> > > > [3] - https://github.com/etsy/statsd
> > > >  
> > > > Best regards,
> > > >  
> > > > --
> > > > Jasper Capel
> > > > Lead Infrastructure Engineer
> > > >  
> > > > W http://www.spilgames.com | S jwcapel-spil
> > > >  
> > > >  
> > > >  
> > > > _______________________________________________
> > > > Mailing list: https://launchpad.net/~openstack
> > > > Post to : openstack@xxxxxxxxxxxxxxxxxxx (mailto:openstack@xxxxxxxxxxxxxxxxxxx)
> > > > Unsubscribe : https://launchpad.net/~openstack
> > > > More help : https://help.launchpad.net/ListHelp
> > > >  
> > >  
> > >  
> > >  
> > > _______________________________________________
> > > Mailing list: https://launchpad.net/~openstack
> > > Post to : openstack@xxxxxxxxxxxxxxxxxxx (mailto:openstack@xxxxxxxxxxxxxxxxxxx)
> > > Unsubscribe : https://launchpad.net/~openstack
> > > More help : https://help.launchpad.net/ListHelp
> > >  
> >  
> >  
>  
>  
>  
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to : openstack@xxxxxxxxxxxxxxxxxxx (mailto:openstack@xxxxxxxxxxxxxxxxxxx)
> Unsubscribe : https://launchpad.net/~openstack
> More help : https://help.launchpad.net/ListHelp
>  
>  



Follow ups

References