← Back to team overview

launchpad-dev team mailing list archive

capturing backtraces on every 'expensive' operation

 

So, I'd like to capture a backtrace whenever we make a SQL query or
HTTP service call.

I think this should be ok, because we don't *want* thousands of
queries or thousands of backend service calls. We want efficient batch
based operations, because doing a query or making an HTTP call is at
least 1ms itself.

The extra data this would generate would help us more quickly identify
the call paths leading to horrendous query counts.

I've done some rough profiling - see the attached file for methodology.

It looks like grabbing and stringifying a 20 frame deep backtrace on
amd64 is about 1ms: doing 1000 takes 1 second on my laptop.

So, doing this naively would add 1 second to a page that does 1000
queries; and 100ms to a page that does 100 queries.

Given that a page doing 1000 queries is almost certainly failing
*now*, I think doing this naively is probably ok.

However, we could do some things to limit the impact. I'm trying to
decide if they are worth doing.

Idea 1: Backtrace on the first 50 queries. There should be some
repetition in there, and after that stop getting backtraces.

Idea 2: Only start capturing backtraces after the first 2 seconds of
request time - small fast pages won't be penalised at all, and slower
ones will be. Optionally combine with idea 1.

Idea 3: capture 5% of queries randomly, allowing a broad overview of
behaviour across the whole request, without imposing 100ms on a 100
query page.

At this point I'm planning on just naively capturing everything: good
pages with 30 or so queries will hardly notice the overhead (in todays
world), and poor pages with > 30 queries will notice more, but already
need fixing - and the act of fixing their query count will
intrinsically fix their measurement overheads.

-Rob
#!/usr/bin/env python

import timeit

# This generates a 20 deep stackframe, and then calls a function that makes a
# stack trace, 1000 times - to simulate a deep appserver stack + oops capturing
# and formating tracebacks.

code = """import traceback
oops = []
def stackframe(depth=20):
    l = locals()
    for i in range(10):
        l[str(i)] = i
    if depth:
        stackframe(depth-1)
    else:
        for i in range(1000):
            oops.append(''.join(traceback.format_stack()))
stackframe()
"""
code2 = """import traceback
oops = []
def stackframe(depth=20):
    l = locals()
    for i in range(10):
        l[str(i)] = i
    if depth:
        stackframe(depth-1)
    else:
        for i in range(1000):
            oops.append('')
stackframe()
"""

# Time with tracebacks
t = timeit.Timer(code)
print t.repeat(number=10)
# Time without
t = timeit.Timer(code2)
print t.repeat(number=10)

Follow ups