← Back to team overview

launchpad-dev team mailing list archive

Re: Etsy graphing using StatsD and Graphite

 

On 23 June 2011 18:44, Martin Pool <mbp@xxxxxxxxxxxxx> wrote:
> On 7 May 2011 11:52, Jonathan Lange <jml@xxxxxxxxxxxxx> wrote:
>> A while ago, Elliot pointed me at this interesting story from Etsy:
>> <http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/>.
>
> I had a go at implementing this in Launchpad at the datascience
> sprint, and did get graphite drawing graphs of the rate at which
> events happened, which was quite nice.  I am quite excited by the idea
> of getting a live view of the rate of say codehost connections, failed
> connections, or the speed at which they're serviced, or the rate of
> oopses in the webapp.
>
> The application interface is basically just to add lines like this:
>
>         send_event('webapp.publication.exception')
>
> this sends a UDP packet to txstatsd, which aggregates them and sends
> the result to graphite, and you get
> <http://www.flickr.com/photos/mbp_/5864784948/in/photostream>.
>
> Robert has decided this should go into Tuolumne/lpstats for now,
> rather than graphite, to be consistent with other metrics.  Tuolumne
> has some shortcomings compared to graphite, but we could switch later.
> There is an ongoing IS project to select and switch to a different
> graphing/data system.
> <https://wiki.canonical.com/InformationInfrastructure/IS/TrendingSolutions>
>
> Robert also points out
> <https://dev.launchpad.net/ArchitectureGuide/ServicesRequirements>.
>
> The idea is to run a single statsd on carob, which is also where
> lpstats runs, so it can write directly into that, probably by piping
> stuff in to its data importer.
>
> I'm inclined to use Sidnei's recently released txstatsd because it
> shouldn't add any new dependencies and it will be easy to tweak to
> write to tuolumne, and to then put it in a deb.
>
> statsd has the nice design attribute that information is sent in to it
> over UDP, so it should be fairly hard for any problem in monitoring to
> slow down or raise errors in Launchpad itself.  The down side is that
> if packets are lost, they're just lost.  One way to mitigate this
> would be to deploy one statsd per machine (they shouldn't be lost over
> loopback unless the statsd is broken) and then use more-reliable tcp
> to talk to the eventual destination.  That's probably better done
> after a switch to graphite (or whatever.)  Another nice thing about
> this is that statsd can somewhat abstract the eventual destination
> from all the various things that talk to it.
>
> So the steps from here seem to be:
>
>  - take the statsd address from lp-production-configs/production.conf
> rather than assuming it's local
>  - add tests for the statsd client code in Launchpad
>  - get this merged
>  - perhaps add some means to test that events are emitted (I don't
> know if adding tests for debug code that will be obvious if it breaks
> really has a good value)
>  - with sysadmins, teach nagios how to monitor that statsd is up
>  - teach txstatsd to emit stats to tuolumne
>  - package txstatsd
>  - add it to the lp dependencies
>  - get the internal firewalls changed so all servers can send udp to carob
>  - document how this ought to be used in the architecture guide,
> including when to use this vs logging
>  - scatter operational event notifications through the code where they
> seem useful

I've put this into
<https://code.launchpad.net/~mbp/launchpad/event-stats>.  I think I'm
going to shelve it until the picture for what stats back end will be
used is more clear, and hopefully until it moves to something more
powerful to tuolumne.

Martin


Follow ups