← Back to team overview

dulwich-users team mailing list archive

Re: Tree entries

 

On Fri, 2010-10-08 at 12:51 -0700, David Borowitz wrote:
> Another place with a completely different order: Tree.add(mode, name,
> sha). I don't suppose that's one we can safely change, is it?
I guess we could check the types of mode and name and reverse them (and
print a DeprecationWarning) if they're wrong.

It's not entirely nice but I think the cleanest way of fixing it.

Cheers,

Jelmer

> On Thu, Oct 7, 2010 at 15:07, David Borowitz <dborowitz@xxxxxxxxxx>
> wrote:
>         Oh, also, another reason I'm leaning towards namedtuples at
>         this point is that they're fully compatible with the existing
>         API. With a custom class, assuming it's faster (which I
>         doubt), we'd either have to:
>         -Deprecate Tree.iteritems(), making there be *three* methods
>         that do the same thing. (Plus all the code changes to use the
>         new API.)
>         -Implement __iter__, basically turning the custom class into a
>         namedtuple anyway.
>         
>         
>         
>         On Thu, Oct 7, 2010 at 14:59, David Borowitz
>         <dborowitz@xxxxxxxxxx> wrote:
>                 I have some benchmark results for namedtuples vs.
>                 tuples. First, the microbenchmark results:
>                 $ python -m timeit -s 'from dulwich.objects import
>                 TreeEntry; name = "foo/bar"; mode = 0100644; sha = "a"
>                 * 20' 'x = TreeEntry(name, mode, sha)'
>                 1000000 loops, best of 3: 0.583 usec per loop
>                 $ python -m timeit -s 'name = "foo/bar"; mode =
>                 0100644; sha = "a" * 20' 'x = (name, mode, sha)'
>                 10000000 loops, best of 3: 0.0753 usec per loop
>                 
>                 
>                 Obviously the tuple constructor should win over
>                 TreeEntry constructor, since the latter is a wrapper
>                 around the former, and there's significant Python
>                 function call overhead. But hey, 0.5us is still pretty
>                 fast.
>                 
>                 
>                 Then I ran a much bigger macrobenchmark (attached).
>                 Basically, I cloned git.git, ran git unpack-object to
>                 explode the repo into loose files, found all the tree
>                 SHAs (without parsing the objects), then measured the
>                 time to parse all those trees and iterate all their
>                 entries. In the inner loop I also assigned all of the
>                 tuple/namedtuple values to locals to check for
>                 overhead there.
>                 
>                 
>                 heapy was used to measure the heap. My workstation is
>                 a Core 2 Quad with 8GB RAM running Ubuntu 10.04. (Note
>                 that the whole heap fit easily in main memory.)
>                 
>                 
>                 === namedtuple ===
>                 $ python tree_entry_macro.py
>                 Read 128900 objects
>                 Collected 43000 trees
>                 Iterated 42900 trees
>                 Collected 9237351 entries in 180.18s. Heap results
>                 follow.
>                 Partition of a set of 36927479 objects. Total size =
>                 2318516600 bytes.
>                  Index  Count   %     Size   % Cumulative  % Kind
>                 (class / dict of class)
>                      0 18452772  50 1273945424  55 1273945424  55 str
>                      1 9237351  25 738988080  32 2012933504  87 tuple
>                      2 9237352  25 221696448  10 2234629952  96 int
>                      3      1   0 83886152   4 2318516104 100 list
>                      4      1   0      448   0 2318516552 100
>                 types.FrameType
>                      5      2   0       48   0 2318516600 100 float
>                 
>                 
>                 === tuple ===
>                 $ python tree_entry_macro.py
>                 Read 128900 objects
>                 Collected 43000 trees
>                 Iterated 42900 trees
>                 Collected 9237351 entries in 179.49s. Heap results
>                 follow.
>                 Partition of a set of 36927479 objects. Total size =
>                 2318516600 bytes.
>                  Index  Count   %     Size   % Cumulative  % Kind
>                 (class / dict of class)
>                      0 18452772  50 1273945424  55 1273945424  55 str
>                      1 9237351  25 738988080  32 2012933504  87 tuple
>                      2 9237352  25 221696448  10 2234629952  96 int
>                      3      1   0 83886152   4 2318516104 100 list
>                      4      1   0      448   0 2318516552 100
>                 types.FrameType
>                      5      2   0       48   0 2318516600 100 float
>                 
>                 
>                 As expected, memory usage is identical, since
>                 namedtuples are just tuples with no additional
>                 per-instance memory overhead. Based on the
>                 microbenchmark results alone, the expected overhead
>                 for creating 9.2M TreeEntry objects alone is 9237351 *
>                 (0.583 - 0.0753) = 4.7s. So I'm guessing the time
>                 difference is not signifcant.
>                 
>                 
>                 On Thu, Oct 7, 2010 at 08:46, David Borowitz
>                 <dborowitz@xxxxxxxxxx> wrote:
>                         
>                         
>                         
>                         On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij
>                         <jelmer@xxxxxxxxx> wrote:
>                                 On Wed, 2010-10-06 at 14:19 -0700,
>                                 David Borowitz wrote:
>                                 > In the stuff I'm doing on rename
>                                 detection, I've been bitten a few
>                                 > times by the weird ordering of
>                                 (name, mode, sha) tuples coming out of
>                                 > various tree iteration methods.
>                                 
>                                 > Currently, we have:
>                                 > Tree.entries() yields (mode, name,
>                                 sha)
>                                 > Tree.iteritems() and
>                                 BaseObjectStore.iter_tree_contents()
>                                 yield (name,
>                                 > mode, sha)
>                                 > index.commit_tree() takes (name,
>                                 sha, mode)
>                                 
>                                 This is all for hysterical raisins,
>                                 unfortunately. I'm all for fixing
>                                 it :-)
>                                 
>                                 > We should really standardize this,
>                                 preferably in one of three ways:
>                                 > 1. Use (name, mode, sha) everywhere.
>                                 > 2. Use a namedtuple of (name, mode,
>                                 sha).
>                                 > 2. Use a new data object.
>                                 > I would prefer using a namedtuple
>                                 would make some of the code I'm
>                                 > about to write cleaner. It would
>                                 also be backwards-compatible with the
>                                 > appropriate type. The downside is
>                                 that it's another feature that's new
>                                 > in 2.6.
>                                 
>                                 It doesn't really help that you have
>                                 two options 2 ! :-)
>                                 
>                                 This is a really performance-sensitive
>                                 area (at least for me), so if we
>                                 go for anything else than a tuple we
>                                 should make sure it doesn't have a
>                                 (significant) impact on performance.
>                                 
>                                 I don't think we should drop support
>                                 for pre-2.6 Python yet; we could of
>                                 course provide a custom implementation
>                                 for those who run 2.4 or 2.5, but
>                                 that seems like more trouble than (3).
>                         
>                         
>                         Augie has told me from experience with hg that
>                         data objects even with slots have worse
>                         performance memory-wise than namedtuples. This
>                         is presumably because there's a per-object
>                         overhead for the slots, whereas namedtuples
>                         have properties that live in the class dict.
>                         I'm happy to run some benchmarks
>                         
>                         
>                         Another thing about the namedtuple function is
>                         that you can make it print out the class
>                         definition it creates, which provides an easy
>                         way to backport specific namedtuple types to
>                         previous versions.
>                          
>                                 3 seems like the best solution if it
>                                 doesn't make things too much
>                                 slower. Of course, we could give it
>                                 semantics similar to what namedtuple
>                                 would give us.
>                         
>                         
>                         I don't mind much either way. It sounds like
>                         benchmark results will be the deciding factor.
>                          
>                                 > In my ideal world we would get rid
>                                 of Tree.entries() and change
>                                 > index.commit_tree() to use the
>                                 standard format, but I don't have a
>                                 > sense of how widely either of those
>                                 are used. Thoughts on how to
>                                 > proceed?
>                                 
>                                 Changing commit_tree() is probably
>                                 possible as it's not very heavily
>                                 used. Removing Tree.entries() is not
>                                 possible I think, at least not
>                                 without deprecating it first.
>                         
>                         
>                         +1 to officially deprecating it. I thought it
>                         was kind of unofficially deprecated already.
>                         (There's two functions that are almost
>                         identical, one is implemented in terms of the
>                         other, and it's only for "historical
>                         reasons"--that's pretty much what a deprecated
>                         function looks like :)
>                          
>                                 Cheers,
>                                 
>                                 Jelmer
>                         
>                 
>                 
>         
>         
> 

Attachment: signature.asc
Description: This is a digitally signed message part


Follow ups

References