← Back to team overview

dulwich-users team mailing list archive

Re: Tree entries

 

I have some benchmark results for namedtuples vs. tuples. First, the
microbenchmark results:
$ python -m timeit -s 'from dulwich.objects import TreeEntry; name =
"foo/bar"; mode = 0100644; sha = "a" * 20' 'x = TreeEntry(name, mode, sha)'
1000000 loops, best of 3: 0.583 usec per loop
$ python -m timeit -s 'name = "foo/bar"; mode = 0100644; sha = "a" * 20' 'x
= (name, mode, sha)'
10000000 loops, best of 3: 0.0753 usec per loop

Obviously the tuple constructor should win over TreeEntry constructor, since
the latter is a wrapper around the former, and there's significant Python
function call overhead. But hey, 0.5us is still pretty fast.

Then I ran a much bigger macrobenchmark (attached). Basically, I cloned
git.git, ran git unpack-object to explode the repo into loose files, found
all the tree SHAs (without parsing the objects), then measured the time to
parse all those trees and iterate all their entries. In the inner loop I
also assigned all of the tuple/namedtuple values to locals to check for
overhead there.

heapy was used to measure the heap. My workstation is a Core 2 Quad with 8GB
RAM running Ubuntu 10.04. (Note that the whole heap fit easily in main
memory.)

=== namedtuple ===
$ python tree_entry_macro.py
Read 128900 objects
Collected 43000 trees
Iterated 42900 trees
Collected 9237351 entries in 180.18s. Heap results follow.
Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 18452772  50 1273945424  55 1273945424  55 str
     1 9237351  25 738988080  32 2012933504  87 tuple
     2 9237352  25 221696448  10 2234629952  96 int
     3      1   0 83886152   4 2318516104 100 list
     4      1   0      448   0 2318516552 100 types.FrameType
     5      2   0       48   0 2318516600 100 float

=== tuple ===
$ python tree_entry_macro.py
Read 128900 objects
Collected 43000 trees
Iterated 42900 trees
Collected 9237351 entries in 179.49s. Heap results follow.
Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 18452772  50 1273945424  55 1273945424  55 str
     1 9237351  25 738988080  32 2012933504  87 tuple
     2 9237352  25 221696448  10 2234629952  96 int
     3      1   0 83886152   4 2318516104 100 list
     4      1   0      448   0 2318516552 100 types.FrameType
     5      2   0       48   0 2318516600 100 float

As expected, memory usage is identical, since namedtuples are just tuples
with no additional per-instance memory overhead. Based on the microbenchmark
results alone, the expected overhead for creating 9.2M TreeEntry objects
alone is 9237351 * (0.583 - 0.0753) = 4.7s. So I'm guessing the time
difference is not signifcant.

On Thu, Oct 7, 2010 at 08:46, David Borowitz <dborowitz@xxxxxxxxxx> wrote:

>
>
> On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:
>
>> On Wed, 2010-10-06 at 14:19 -0700, David Borowitz wrote:
>> > In the stuff I'm doing on rename detection, I've been bitten a few
>> > times by the weird ordering of (name, mode, sha) tuples coming out of
>> > various tree iteration methods.
>>
>> > Currently, we have:
>> > Tree.entries() yields (mode, name, sha)
>> > Tree.iteritems() and BaseObjectStore.iter_tree_contents() yield (name,
>> > mode, sha)
>> > index.commit_tree() takes (name, sha, mode)
>> This is all for hysterical raisins, unfortunately. I'm all for fixing
>> it :-)
>>
>> > We should really standardize this, preferably in one of three ways:
>> > 1. Use (name, mode, sha) everywhere.
>> > 2. Use a namedtuple of (name, mode, sha).
>> > 2. Use a new data object.
>> > I would prefer using a namedtuple would make some of the code I'm
>> > about to write cleaner. It would also be backwards-compatible with the
>> > appropriate type. The downside is that it's another feature that's new
>> > in 2.6.
>> It doesn't really help that you have two options 2 ! :-)
>>
>> This is a really performance-sensitive area (at least for me), so if we
>> go for anything else than a tuple we should make sure it doesn't have a
>> (significant) impact on performance.
>>
>> I don't think we should drop support for pre-2.6 Python yet; we could of
>> course provide a custom implementation for those who run 2.4 or 2.5, but
>> that seems like more trouble than (3).
>>
>
> Augie has told me from experience with hg that data objects even with slots
> have worse performance memory-wise than namedtuples. This is presumably
> because there's a per-object overhead for the slots, whereas namedtuples
> have properties that live in the class dict. I'm happy to run some
> benchmarks
>
> Another thing about the namedtuple function is that you can make it print
> out the class definition it creates, which provides an easy way to backport
> specific namedtuple types to previous versions.
>
>
>> 3 seems like the best solution if it doesn't make things too much
>> slower. Of course, we could give it semantics similar to what namedtuple
>> would give us.
>
>
> I don't mind much either way. It sounds like benchmark results will be the
> deciding factor.
>
>
>>  > In my ideal world we would get rid of Tree.entries() and change
>> > index.commit_tree() to use the standard format, but I don't have a
>> > sense of how widely either of those are used. Thoughts on how to
>> > proceed?
>> Changing commit_tree() is probably possible as it's not very heavily
>> used. Removing Tree.entries() is not possible I think, at least not
>> without deprecating it first.
>>
>
> +1 to officially deprecating it. I thought it was kind of unofficially
> deprecated already. (There's two functions that are almost identical, one is
> implemented in terms of the other, and it's only for "historical
> reasons"--that's pretty much what a deprecated function looks like :)
>
>
>> Cheers,
>>
>> Jelmer
>>
>
>
#!/usr/bin/python
#
# Copyright 2010 Google Inc. All Rights Reserved.

PATH = '/home/dborowitz/d/git.loose'

import os
import sys
import time

from dulwich import objects
from dulwich import repo
import guppy


def main():
    hp = guppy.hpy()
    r = repo.Repo('/home/dborowitz/d/git.loose')
    tree_shas = []
    for i, sha in enumerate(r.object_store):
        if not i % 1000:
            sys.stdout.write('\rRead %i objects' % i)
            sys.stdout.flush()
        with open(os.path.join(r.object_store.path, sha[:2], sha[2:])) as f:
            # Parse the file header but not the contents.
            obj = objects.ShaFile._parse_legacy_object_header('', f)
            if isinstance(obj, objects.Tree):
                tree_shas.append(sha)
    sys.stdout.write('\nCollected %i trees\n' % len(tree_shas))

    hp.setrelheap()

    entries = []
    start_time = time.time()
    for i, tree_sha in enumerate(tree_shas):
        if not i % 100:
            sys.stdout.write('\rParsed %i trees' % i)
            sys.stdout.flush()
        for entry in r.object_store[tree_sha].iteritems():
            entries.append(entry)
            if type(entry) == tuple:
                name, mode, sha = entry
            else:
                name = entry.name
                mode = entry.mode
                sha = entry.sha
    end_time = time.time()

    del tree_shas

    sys.stdout.write('\nCollected %i entries in %.2fs. Heap results follow.\n' %
                     (len(entries), end_time - start_time))
    print hp.heap()


if __name__ == '__main__':
    main()

Follow ups

References