dulwich-users team mailing list archive

Thread
Date
Re: Tree entries

To: Jelmer Vernooij <jelmer@xxxxxxxxx>
From: David Borowitz <dborowitz@xxxxxxxxxx>
Date: Thu, 7 Oct 2010 15:07:28 -0700
Cc: dulwich-users <dulwich-users@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <AANLkTi=97ms+irXJ9m1vds-to01H=XP8AEBse-8E3+g8@mail.gmail.com>
Oh, also, another reason I'm leaning towards namedtuples at this point is
that they're fully compatible with the existing API. With a custom class,
assuming it's faster (which I doubt), we'd either have to:
-Deprecate Tree.iteritems(), making there be *three* methods that do the
same thing. (Plus all the code changes to use the new API.)
-Implement __iter__, basically turning the custom class into a namedtuple
anyway.

On Thu, Oct 7, 2010 at 14:59, David Borowitz <dborowitz@xxxxxxxxxx> wrote:

> I have some benchmark results for namedtuples vs. tuples. First, the
> microbenchmark results:
> $ python -m timeit -s 'from dulwich.objects import TreeEntry; name =
> "foo/bar"; mode = 0100644; sha = "a" * 20' 'x = TreeEntry(name, mode, sha)'
> 1000000 loops, best of 3: 0.583 usec per loop
> $ python -m timeit -s 'name = "foo/bar"; mode = 0100644; sha = "a" * 20' 'x
> = (name, mode, sha)'
> 10000000 loops, best of 3: 0.0753 usec per loop
>
> Obviously the tuple constructor should win over TreeEntry constructor,
> since the latter is a wrapper around the former, and there's significant
> Python function call overhead. But hey, 0.5us is still pretty fast.
>
> Then I ran a much bigger macrobenchmark (attached). Basically, I cloned
> git.git, ran git unpack-object to explode the repo into loose files, found
> all the tree SHAs (without parsing the objects), then measured the time to
> parse all those trees and iterate all their entries. In the inner loop I
> also assigned all of the tuple/namedtuple values to locals to check for
> overhead there.
>
> heapy was used to measure the heap. My workstation is a Core 2 Quad with
> 8GB RAM running Ubuntu 10.04. (Note that the whole heap fit easily in main
> memory.)
>
> === namedtuple ===
> $ python tree_entry_macro.py
> Read 128900 objects
> Collected 43000 trees
> Iterated 42900 trees
> Collected 9237351 entries in 180.18s. Heap results follow.
> Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
>  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0 18452772  50 1273945424  55 1273945424  55 str
>      1 9237351  25 738988080  32 2012933504  87 tuple
>      2 9237352  25 221696448  10 2234629952  96 int
>      3      1   0 83886152   4 2318516104 100 list
>      4      1   0      448   0 2318516552 100 types.FrameType
>      5      2   0       48   0 2318516600 100 float
>
> === tuple ===
> $ python tree_entry_macro.py
> Read 128900 objects
> Collected 43000 trees
> Iterated 42900 trees
> Collected 9237351 entries in 179.49s. Heap results follow.
> Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
>  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0 18452772  50 1273945424  55 1273945424  55 str
>      1 9237351  25 738988080  32 2012933504  87 tuple
>      2 9237352  25 221696448  10 2234629952  96 int
>      3      1   0 83886152   4 2318516104 100 list
>      4      1   0      448   0 2318516552 100 types.FrameType
>      5      2   0       48   0 2318516600 100 float
>
> As expected, memory usage is identical, since namedtuples are just tuples
> with no additional per-instance memory overhead. Based on the microbenchmark
> results alone, the expected overhead for creating 9.2M TreeEntry objects
> alone is 9237351 * (0.583 - 0.0753) = 4.7s. So I'm guessing the time
> difference is not signifcant.
>
> On Thu, Oct 7, 2010 at 08:46, David Borowitz <dborowitz@xxxxxxxxxx> wrote:
>
>>
>>
>> On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:
>>
>>> On Wed, 2010-10-06 at 14:19 -0700, David Borowitz wrote:
>>> > In the stuff I'm doing on rename detection, I've been bitten a few
>>> > times by the weird ordering of (name, mode, sha) tuples coming out of
>>> > various tree iteration methods.
>>>
>>> > Currently, we have:
>>> > Tree.entries() yields (mode, name, sha)
>>> > Tree.iteritems() and BaseObjectStore.iter_tree_contents() yield (name,
>>> > mode, sha)
>>> > index.commit_tree() takes (name, sha, mode)
>>> This is all for hysterical raisins, unfortunately. I'm all for fixing
>>> it :-)
>>>
>>> > We should really standardize this, preferably in one of three ways:
>>> > 1. Use (name, mode, sha) everywhere.
>>> > 2. Use a namedtuple of (name, mode, sha).
>>> > 2. Use a new data object.
>>> > I would prefer using a namedtuple would make some of the code I'm
>>> > about to write cleaner. It would also be backwards-compatible with the
>>> > appropriate type. The downside is that it's another feature that's new
>>> > in 2.6.
>>> It doesn't really help that you have two options 2 ! :-)
>>>
>>> This is a really performance-sensitive area (at least for me), so if we
>>> go for anything else than a tuple we should make sure it doesn't have a
>>> (significant) impact on performance.
>>>
>>> I don't think we should drop support for pre-2.6 Python yet; we could of
>>> course provide a custom implementation for those who run 2.4 or 2.5, but
>>> that seems like more trouble than (3).
>>>
>>
>> Augie has told me from experience with hg that data objects even with
>> slots have worse performance memory-wise than namedtuples. This is
>> presumably because there's a per-object overhead for the slots, whereas
>> namedtuples have properties that live in the class dict. I'm happy to run
>> some benchmarks
>>
>> Another thing about the namedtuple function is that you can make it print
>> out the class definition it creates, which provides an easy way to backport
>> specific namedtuple types to previous versions.
>>
>>
>>> 3 seems like the best solution if it doesn't make things too much
>>> slower. Of course, we could give it semantics similar to what namedtuple
>>> would give us.
>>
>>
>> I don't mind much either way. It sounds like benchmark results will be the
>> deciding factor.
>>
>>
>>>  > In my ideal world we would get rid of Tree.entries() and change
>>> > index.commit_tree() to use the standard format, but I don't have a
>>> > sense of how widely either of those are used. Thoughts on how to
>>> > proceed?
>>> Changing commit_tree() is probably possible as it's not very heavily
>>> used. Removing Tree.entries() is not possible I think, at least not
>>> without deprecating it first.
>>>
>>
>> +1 to officially deprecating it. I thought it was kind of unofficially
>> deprecated already. (There's two functions that are almost identical, one is
>> implemented in terms of the other, and it's only for "historical
>> reasons"--that's pretty much what a deprecated function looks like :)
>>
>>
>>> Cheers,
>>>
>>> Jelmer
>>>
>>
>>
>
Follow ups

Re: Tree entries
From: David Borowitz, 2010-10-08
References

Tree entries
From: David Borowitz, 2010-10-06
Re: Tree entries
From: Jelmer Vernooij, 2010-10-07
Re: Tree entries
From: David Borowitz, 2010-10-07
Re: Tree entries
From: David Borowitz, 2010-10-07