← Back to team overview

dulwich-users team mailing list archive

Re: Tree entries

 

On Sat, Oct 16, 2010 at 14:59, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:

> On Fri, 2010-10-08 at 12:51 -0700, David Borowitz wrote:
> > Another place with a completely different order: Tree.add(mode, name,
> > sha). I don't suppose that's one we can safely change, is it?
> I guess we could check the types of mode and name and reverse them (and
> print a DeprecationWarning) if they're wrong.
>
> It's not entirely nice but I think the cleanest way of fixing it.
>

Yeah. I don't have a sense of whether this is a thing other Python projects
do...

I'm happy to make this change but it's not on my critical path at the
moment.


> Cheers,
>
> Jelmer
>
> > On Thu, Oct 7, 2010 at 15:07, David Borowitz <dborowitz@xxxxxxxxxx>
> > wrote:
> >         Oh, also, another reason I'm leaning towards namedtuples at
> >         this point is that they're fully compatible with the existing
> >         API. With a custom class, assuming it's faster (which I
> >         doubt), we'd either have to:
> >         -Deprecate Tree.iteritems(), making there be *three* methods
> >         that do the same thing. (Plus all the code changes to use the
> >         new API.)
> >         -Implement __iter__, basically turning the custom class into a
> >         namedtuple anyway.
> >
> >
> >
> >         On Thu, Oct 7, 2010 at 14:59, David Borowitz
> >         <dborowitz@xxxxxxxxxx> wrote:
> >                 I have some benchmark results for namedtuples vs.
> >                 tuples. First, the microbenchmark results:
> >                 $ python -m timeit -s 'from dulwich.objects import
> >                 TreeEntry; name = "foo/bar"; mode = 0100644; sha = "a"
> >                 * 20' 'x = TreeEntry(name, mode, sha)'
> >                 1000000 loops, best of 3: 0.583 usec per loop
> >                 $ python -m timeit -s 'name = "foo/bar"; mode =
> >                 0100644; sha = "a" * 20' 'x = (name, mode, sha)'
> >                 10000000 loops, best of 3: 0.0753 usec per loop
> >
> >
> >                 Obviously the tuple constructor should win over
> >                 TreeEntry constructor, since the latter is a wrapper
> >                 around the former, and there's significant Python
> >                 function call overhead. But hey, 0.5us is still pretty
> >                 fast.
> >
> >
> >                 Then I ran a much bigger macrobenchmark (attached).
> >                 Basically, I cloned git.git, ran git unpack-object to
> >                 explode the repo into loose files, found all the tree
> >                 SHAs (without parsing the objects), then measured the
> >                 time to parse all those trees and iterate all their
> >                 entries. In the inner loop I also assigned all of the
> >                 tuple/namedtuple values to locals to check for
> >                 overhead there.
> >
> >
> >                 heapy was used to measure the heap. My workstation is
> >                 a Core 2 Quad with 8GB RAM running Ubuntu 10.04. (Note
> >                 that the whole heap fit easily in main memory.)
> >
> >
> >                 === namedtuple ===
> >                 $ python tree_entry_macro.py
> >                 Read 128900 objects
> >                 Collected 43000 trees
> >                 Iterated 42900 trees
> >                 Collected 9237351 entries in 180.18s. Heap results
> >                 follow.
> >                 Partition of a set of 36927479 objects. Total size =
> >                 2318516600 bytes.
> >                  Index  Count   %     Size   % Cumulative  % Kind
> >                 (class / dict of class)
> >                      0 18452772  50 1273945424  55 1273945424  55 str
> >                      1 9237351  25 738988080  32 2012933504  87 tuple
> >                      2 9237352  25 221696448  10 2234629952  96 int
> >                      3      1   0 83886152   4 2318516104 100 list
> >                      4      1   0      448   0 2318516552 100
> >                 types.FrameType
> >                      5      2   0       48   0 2318516600 100 float
> >
> >
> >                 === tuple ===
> >                 $ python tree_entry_macro.py
> >                 Read 128900 objects
> >                 Collected 43000 trees
> >                 Iterated 42900 trees
> >                 Collected 9237351 entries in 179.49s. Heap results
> >                 follow.
> >                 Partition of a set of 36927479 objects. Total size =
> >                 2318516600 bytes.
> >                  Index  Count   %     Size   % Cumulative  % Kind
> >                 (class / dict of class)
> >                      0 18452772  50 1273945424  55 1273945424  55 str
> >                      1 9237351  25 738988080  32 2012933504  87 tuple
> >                      2 9237352  25 221696448  10 2234629952  96 int
> >                      3      1   0 83886152   4 2318516104 100 list
> >                      4      1   0      448   0 2318516552 100
> >                 types.FrameType
> >                      5      2   0       48   0 2318516600 100 float
> >
> >
> >                 As expected, memory usage is identical, since
> >                 namedtuples are just tuples with no additional
> >                 per-instance memory overhead. Based on the
> >                 microbenchmark results alone, the expected overhead
> >                 for creating 9.2M TreeEntry objects alone is 9237351 *
> >                 (0.583 - 0.0753) = 4.7s. So I'm guessing the time
> >                 difference is not signifcant.
> >
> >
> >                 On Thu, Oct 7, 2010 at 08:46, David Borowitz
> >                 <dborowitz@xxxxxxxxxx> wrote:
> >
> >
> >
> >                         On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij
> >                         <jelmer@xxxxxxxxx> wrote:
> >                                 On Wed, 2010-10-06 at 14:19 -0700,
> >                                 David Borowitz wrote:
> >                                 > In the stuff I'm doing on rename
> >                                 detection, I've been bitten a few
> >                                 > times by the weird ordering of
> >                                 (name, mode, sha) tuples coming out of
> >                                 > various tree iteration methods.
> >
> >                                 > Currently, we have:
> >                                 > Tree.entries() yields (mode, name,
> >                                 sha)
> >                                 > Tree.iteritems() and
> >                                 BaseObjectStore.iter_tree_contents()
> >                                 yield (name,
> >                                 > mode, sha)
> >                                 > index.commit_tree() takes (name,
> >                                 sha, mode)
> >
> >                                 This is all for hysterical raisins,
> >                                 unfortunately. I'm all for fixing
> >                                 it :-)
> >
> >                                 > We should really standardize this,
> >                                 preferably in one of three ways:
> >                                 > 1. Use (name, mode, sha) everywhere.
> >                                 > 2. Use a namedtuple of (name, mode,
> >                                 sha).
> >                                 > 2. Use a new data object.
> >                                 > I would prefer using a namedtuple
> >                                 would make some of the code I'm
> >                                 > about to write cleaner. It would
> >                                 also be backwards-compatible with the
> >                                 > appropriate type. The downside is
> >                                 that it's another feature that's new
> >                                 > in 2.6.
> >
> >                                 It doesn't really help that you have
> >                                 two options 2 ! :-)
> >
> >                                 This is a really performance-sensitive
> >                                 area (at least for me), so if we
> >                                 go for anything else than a tuple we
> >                                 should make sure it doesn't have a
> >                                 (significant) impact on performance.
> >
> >                                 I don't think we should drop support
> >                                 for pre-2.6 Python yet; we could of
> >                                 course provide a custom implementation
> >                                 for those who run 2.4 or 2.5, but
> >                                 that seems like more trouble than (3).
> >
> >
> >                         Augie has told me from experience with hg that
> >                         data objects even with slots have worse
> >                         performance memory-wise than namedtuples. This
> >                         is presumably because there's a per-object
> >                         overhead for the slots, whereas namedtuples
> >                         have properties that live in the class dict.
> >                         I'm happy to run some benchmarks
> >
> >
> >                         Another thing about the namedtuple function is
> >                         that you can make it print out the class
> >                         definition it creates, which provides an easy
> >                         way to backport specific namedtuple types to
> >                         previous versions.
> >
> >                                 3 seems like the best solution if it
> >                                 doesn't make things too much
> >                                 slower. Of course, we could give it
> >                                 semantics similar to what namedtuple
> >                                 would give us.
> >
> >
> >                         I don't mind much either way. It sounds like
> >                         benchmark results will be the deciding factor.
> >
> >                                 > In my ideal world we would get rid
> >                                 of Tree.entries() and change
> >                                 > index.commit_tree() to use the
> >                                 standard format, but I don't have a
> >                                 > sense of how widely either of those
> >                                 are used. Thoughts on how to
> >                                 > proceed?
> >
> >                                 Changing commit_tree() is probably
> >                                 possible as it's not very heavily
> >                                 used. Removing Tree.entries() is not
> >                                 possible I think, at least not
> >                                 without deprecating it first.
> >
> >
> >                         +1 to officially deprecating it. I thought it
> >                         was kind of unofficially deprecated already.
> >                         (There's two functions that are almost
> >                         identical, one is implemented in terms of the
> >                         other, and it's only for "historical
> >                         reasons"--that's pretty much what a deprecated
> >                         function looks like :)
> >
> >                                 Cheers,
> >
> >                                 Jelmer
> >
> >
> >
> >
> >
> >
>
>

Follow ups

References