dulwich-users team mailing list archive
-
dulwich-users team
-
Mailing list archive
-
Message #00236
Re: Tree entries
Another place with a completely different order: Tree.add(mode, name, sha).
I don't suppose that's one we can safely change, is it?
On Thu, Oct 7, 2010 at 15:07, David Borowitz <dborowitz@xxxxxxxxxx> wrote:
> Oh, also, another reason I'm leaning towards namedtuples at this point is
> that they're fully compatible with the existing API. With a custom class,
> assuming it's faster (which I doubt), we'd either have to:
> -Deprecate Tree.iteritems(), making there be *three* methods that do the
> same thing. (Plus all the code changes to use the new API.)
> -Implement __iter__, basically turning the custom class into a namedtuple
> anyway.
>
>
> On Thu, Oct 7, 2010 at 14:59, David Borowitz <dborowitz@xxxxxxxxxx> wrote:
>
>> I have some benchmark results for namedtuples vs. tuples. First, the
>> microbenchmark results:
>> $ python -m timeit -s 'from dulwich.objects import TreeEntry; name =
>> "foo/bar"; mode = 0100644; sha = "a" * 20' 'x = TreeEntry(name, mode, sha)'
>> 1000000 loops, best of 3: 0.583 usec per loop
>> $ python -m timeit -s 'name = "foo/bar"; mode = 0100644; sha = "a" * 20'
>> 'x = (name, mode, sha)'
>> 10000000 loops, best of 3: 0.0753 usec per loop
>>
>> Obviously the tuple constructor should win over TreeEntry constructor,
>> since the latter is a wrapper around the former, and there's significant
>> Python function call overhead. But hey, 0.5us is still pretty fast.
>>
>> Then I ran a much bigger macrobenchmark (attached). Basically, I cloned
>> git.git, ran git unpack-object to explode the repo into loose files, found
>> all the tree SHAs (without parsing the objects), then measured the time to
>> parse all those trees and iterate all their entries. In the inner loop I
>> also assigned all of the tuple/namedtuple values to locals to check for
>> overhead there.
>>
>> heapy was used to measure the heap. My workstation is a Core 2 Quad with
>> 8GB RAM running Ubuntu 10.04. (Note that the whole heap fit easily in main
>> memory.)
>>
>> === namedtuple ===
>> $ python tree_entry_macro.py
>> Read 128900 objects
>> Collected 43000 trees
>> Iterated 42900 trees
>> Collected 9237351 entries in 180.18s. Heap results follow.
>> Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
>> Index Count % Size % Cumulative % Kind (class / dict of class)
>> 0 18452772 50 1273945424 55 1273945424 55 str
>> 1 9237351 25 738988080 32 2012933504 87 tuple
>> 2 9237352 25 221696448 10 2234629952 96 int
>> 3 1 0 83886152 4 2318516104 100 list
>> 4 1 0 448 0 2318516552 100 types.FrameType
>> 5 2 0 48 0 2318516600 100 float
>>
>> === tuple ===
>> $ python tree_entry_macro.py
>> Read 128900 objects
>> Collected 43000 trees
>> Iterated 42900 trees
>> Collected 9237351 entries in 179.49s. Heap results follow.
>> Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
>> Index Count % Size % Cumulative % Kind (class / dict of class)
>> 0 18452772 50 1273945424 55 1273945424 55 str
>> 1 9237351 25 738988080 32 2012933504 87 tuple
>> 2 9237352 25 221696448 10 2234629952 96 int
>> 3 1 0 83886152 4 2318516104 100 list
>> 4 1 0 448 0 2318516552 100 types.FrameType
>> 5 2 0 48 0 2318516600 100 float
>>
>> As expected, memory usage is identical, since namedtuples are just tuples
>> with no additional per-instance memory overhead. Based on the microbenchmark
>> results alone, the expected overhead for creating 9.2M TreeEntry objects
>> alone is 9237351 * (0.583 - 0.0753) = 4.7s. So I'm guessing the time
>> difference is not signifcant.
>>
>> On Thu, Oct 7, 2010 at 08:46, David Borowitz <dborowitz@xxxxxxxxxx>wrote:
>>
>>>
>>>
>>> On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:
>>>
>>>> On Wed, 2010-10-06 at 14:19 -0700, David Borowitz wrote:
>>>> > In the stuff I'm doing on rename detection, I've been bitten a few
>>>> > times by the weird ordering of (name, mode, sha) tuples coming out of
>>>> > various tree iteration methods.
>>>>
>>>> > Currently, we have:
>>>> > Tree.entries() yields (mode, name, sha)
>>>> > Tree.iteritems() and BaseObjectStore.iter_tree_contents() yield (name,
>>>> > mode, sha)
>>>> > index.commit_tree() takes (name, sha, mode)
>>>> This is all for hysterical raisins, unfortunately. I'm all for fixing
>>>> it :-)
>>>>
>>>> > We should really standardize this, preferably in one of three ways:
>>>> > 1. Use (name, mode, sha) everywhere.
>>>> > 2. Use a namedtuple of (name, mode, sha).
>>>> > 2. Use a new data object.
>>>> > I would prefer using a namedtuple would make some of the code I'm
>>>> > about to write cleaner. It would also be backwards-compatible with the
>>>> > appropriate type. The downside is that it's another feature that's new
>>>> > in 2.6.
>>>> It doesn't really help that you have two options 2 ! :-)
>>>>
>>>> This is a really performance-sensitive area (at least for me), so if we
>>>> go for anything else than a tuple we should make sure it doesn't have a
>>>> (significant) impact on performance.
>>>>
>>>> I don't think we should drop support for pre-2.6 Python yet; we could of
>>>> course provide a custom implementation for those who run 2.4 or 2.5, but
>>>> that seems like more trouble than (3).
>>>>
>>>
>>> Augie has told me from experience with hg that data objects even with
>>> slots have worse performance memory-wise than namedtuples. This is
>>> presumably because there's a per-object overhead for the slots, whereas
>>> namedtuples have properties that live in the class dict. I'm happy to run
>>> some benchmarks
>>>
>>> Another thing about the namedtuple function is that you can make it print
>>> out the class definition it creates, which provides an easy way to backport
>>> specific namedtuple types to previous versions.
>>>
>>>
>>>> 3 seems like the best solution if it doesn't make things too much
>>>> slower. Of course, we could give it semantics similar to what namedtuple
>>>> would give us.
>>>
>>>
>>> I don't mind much either way. It sounds like benchmark results will be
>>> the deciding factor.
>>>
>>>
>>>> > In my ideal world we would get rid of Tree.entries() and change
>>>> > index.commit_tree() to use the standard format, but I don't have a
>>>> > sense of how widely either of those are used. Thoughts on how to
>>>> > proceed?
>>>> Changing commit_tree() is probably possible as it's not very heavily
>>>> used. Removing Tree.entries() is not possible I think, at least not
>>>> without deprecating it first.
>>>>
>>>
>>> +1 to officially deprecating it. I thought it was kind of unofficially
>>> deprecated already. (There's two functions that are almost identical, one is
>>> implemented in terms of the other, and it's only for "historical
>>> reasons"--that's pretty much what a deprecated function looks like :)
>>>
>>>
>>>> Cheers,
>>>>
>>>> Jelmer
>>>>
>>>
>>>
>>
>
Follow ups
References