← Back to team overview

launchpad-dev team mailing list archive

Re: performance dashboard?

 

Hi Robert,

On Fri, May 7, 2010 at 2:48 AM, Robert Collins
<robert.collins@xxxxxxxxxxxxx> wrote:
> I was at devopsdownunder last weekend and saw a demo of a very
> interesting tool. Have a look at
> http://rpm.newrelic.com/v2/accounts/12842/applications/113766 -
> ignoring the bling, its a tool for individual and aggregated
> statistics on *every single request* going through an application
> stack.
>
> Kind of what we get with oops reports (database time, python time) but
> pervasive rather than only-on-the-broken-requests.
>
> I think one of the challenges with performance work at the moment -
> and please, correct me if I'm wrong - is that individual developers
> can't easily, routinely see where things are at. Right now, when
> someone asks 'why is xxx slow', the best we can do is:
>  - add ++oops++ to the url to trigger an oops
>  - wait 3 +- 3 minutes for it to sync
>  - look it up on the oops website
>
> This has two issues:
>  - we can't see if its *usual* for that page to be slow, or if its
> unusually slow for one individual.
>  - its slow and cumbersome.
>
> For instance, if we want 100ms page generation, it would be terribly
> useful to be able to see that right now, on average, we're spending
> (say) 60ms in the database.
>
> Now, I'm not suggesting we go out and invent such a dashboard itself -
> there's going to be a tonne of investment needed to do that, but
> perhaps there is an open source version of this out there already for
> zope apps? Or perhaps we could look at providing a zope plugin to talk
> to newrelic?

I'd go as far as saying that there would be nothing Zope-specific to
collecting the kind of metrics that would be interesting. The great
majority of stats would be collected at the HAProxy/Squid/Apache
level, and the remaining ones would likely be Storm or other
subsystems like RabbitMQ for Landscape. Maybe finding out if the
threads of a certain Zope app server are exausted would be useful, but
that's the only thing that comes to my mind.

One thing we recently added to Landscape was a little debugging helper
that can be enabled during development, and looks like this:

  http://www.ubuntu-pics.de/bild/58528/selection_061_X2Xt9L.png

The situation is the same. We have a way to collect some metrics but
if they are not aggregated there's not much point in having them
except during development.

I have a lot of interest in the subject of collecting metrics and
analyzing bottlenecks, I even have a book or two around that I
recommend for people that want to dig into the subject (eg: The Art of
Capacity Planning). However, when it comes down to actually doing it I
feel like our developers are way too distant from the LOSAs. It might
be that I just never tried to get a new metric graphed, and I've never
seen any graph from Apache or HAProxy internally, though I trust that
they exist and someone is watching over them.

-- Sidnei



Follow ups

References