dulwich-users team mailing list archive
-
dulwich-users team
-
Mailing list archive
-
Message #00239
Re: Tree entries
On Fri, 2010-10-08 at 12:51 -0700, David Borowitz wrote:
> Another place with a completely different order: Tree.add(mode, name,
> sha). I don't suppose that's one we can safely change, is it?
I guess we could check the types of mode and name and reverse them (and
print a DeprecationWarning) if they're wrong.
It's not entirely nice but I think the cleanest way of fixing it.
Cheers,
Jelmer
> On Thu, Oct 7, 2010 at 15:07, David Borowitz <dborowitz@xxxxxxxxxx>
> wrote:
> Oh, also, another reason I'm leaning towards namedtuples at
> this point is that they're fully compatible with the existing
> API. With a custom class, assuming it's faster (which I
> doubt), we'd either have to:
> -Deprecate Tree.iteritems(), making there be *three* methods
> that do the same thing. (Plus all the code changes to use the
> new API.)
> -Implement __iter__, basically turning the custom class into a
> namedtuple anyway.
>
>
>
> On Thu, Oct 7, 2010 at 14:59, David Borowitz
> <dborowitz@xxxxxxxxxx> wrote:
> I have some benchmark results for namedtuples vs.
> tuples. First, the microbenchmark results:
> $ python -m timeit -s 'from dulwich.objects import
> TreeEntry; name = "foo/bar"; mode = 0100644; sha = "a"
> * 20' 'x = TreeEntry(name, mode, sha)'
> 1000000 loops, best of 3: 0.583 usec per loop
> $ python -m timeit -s 'name = "foo/bar"; mode =
> 0100644; sha = "a" * 20' 'x = (name, mode, sha)'
> 10000000 loops, best of 3: 0.0753 usec per loop
>
>
> Obviously the tuple constructor should win over
> TreeEntry constructor, since the latter is a wrapper
> around the former, and there's significant Python
> function call overhead. But hey, 0.5us is still pretty
> fast.
>
>
> Then I ran a much bigger macrobenchmark (attached).
> Basically, I cloned git.git, ran git unpack-object to
> explode the repo into loose files, found all the tree
> SHAs (without parsing the objects), then measured the
> time to parse all those trees and iterate all their
> entries. In the inner loop I also assigned all of the
> tuple/namedtuple values to locals to check for
> overhead there.
>
>
> heapy was used to measure the heap. My workstation is
> a Core 2 Quad with 8GB RAM running Ubuntu 10.04. (Note
> that the whole heap fit easily in main memory.)
>
>
> === namedtuple ===
> $ python tree_entry_macro.py
> Read 128900 objects
> Collected 43000 trees
> Iterated 42900 trees
> Collected 9237351 entries in 180.18s. Heap results
> follow.
> Partition of a set of 36927479 objects. Total size =
> 2318516600 bytes.
> Index Count % Size % Cumulative % Kind
> (class / dict of class)
> 0 18452772 50 1273945424 55 1273945424 55 str
> 1 9237351 25 738988080 32 2012933504 87 tuple
> 2 9237352 25 221696448 10 2234629952 96 int
> 3 1 0 83886152 4 2318516104 100 list
> 4 1 0 448 0 2318516552 100
> types.FrameType
> 5 2 0 48 0 2318516600 100 float
>
>
> === tuple ===
> $ python tree_entry_macro.py
> Read 128900 objects
> Collected 43000 trees
> Iterated 42900 trees
> Collected 9237351 entries in 179.49s. Heap results
> follow.
> Partition of a set of 36927479 objects. Total size =
> 2318516600 bytes.
> Index Count % Size % Cumulative % Kind
> (class / dict of class)
> 0 18452772 50 1273945424 55 1273945424 55 str
> 1 9237351 25 738988080 32 2012933504 87 tuple
> 2 9237352 25 221696448 10 2234629952 96 int
> 3 1 0 83886152 4 2318516104 100 list
> 4 1 0 448 0 2318516552 100
> types.FrameType
> 5 2 0 48 0 2318516600 100 float
>
>
> As expected, memory usage is identical, since
> namedtuples are just tuples with no additional
> per-instance memory overhead. Based on the
> microbenchmark results alone, the expected overhead
> for creating 9.2M TreeEntry objects alone is 9237351 *
> (0.583 - 0.0753) = 4.7s. So I'm guessing the time
> difference is not signifcant.
>
>
> On Thu, Oct 7, 2010 at 08:46, David Borowitz
> <dborowitz@xxxxxxxxxx> wrote:
>
>
>
> On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij
> <jelmer@xxxxxxxxx> wrote:
> On Wed, 2010-10-06 at 14:19 -0700,
> David Borowitz wrote:
> > In the stuff I'm doing on rename
> detection, I've been bitten a few
> > times by the weird ordering of
> (name, mode, sha) tuples coming out of
> > various tree iteration methods.
>
> > Currently, we have:
> > Tree.entries() yields (mode, name,
> sha)
> > Tree.iteritems() and
> BaseObjectStore.iter_tree_contents()
> yield (name,
> > mode, sha)
> > index.commit_tree() takes (name,
> sha, mode)
>
> This is all for hysterical raisins,
> unfortunately. I'm all for fixing
> it :-)
>
> > We should really standardize this,
> preferably in one of three ways:
> > 1. Use (name, mode, sha) everywhere.
> > 2. Use a namedtuple of (name, mode,
> sha).
> > 2. Use a new data object.
> > I would prefer using a namedtuple
> would make some of the code I'm
> > about to write cleaner. It would
> also be backwards-compatible with the
> > appropriate type. The downside is
> that it's another feature that's new
> > in 2.6.
>
> It doesn't really help that you have
> two options 2 ! :-)
>
> This is a really performance-sensitive
> area (at least for me), so if we
> go for anything else than a tuple we
> should make sure it doesn't have a
> (significant) impact on performance.
>
> I don't think we should drop support
> for pre-2.6 Python yet; we could of
> course provide a custom implementation
> for those who run 2.4 or 2.5, but
> that seems like more trouble than (3).
>
>
> Augie has told me from experience with hg that
> data objects even with slots have worse
> performance memory-wise than namedtuples. This
> is presumably because there's a per-object
> overhead for the slots, whereas namedtuples
> have properties that live in the class dict.
> I'm happy to run some benchmarks
>
>
> Another thing about the namedtuple function is
> that you can make it print out the class
> definition it creates, which provides an easy
> way to backport specific namedtuple types to
> previous versions.
>
> 3 seems like the best solution if it
> doesn't make things too much
> slower. Of course, we could give it
> semantics similar to what namedtuple
> would give us.
>
>
> I don't mind much either way. It sounds like
> benchmark results will be the deciding factor.
>
> > In my ideal world we would get rid
> of Tree.entries() and change
> > index.commit_tree() to use the
> standard format, but I don't have a
> > sense of how widely either of those
> are used. Thoughts on how to
> > proceed?
>
> Changing commit_tree() is probably
> possible as it's not very heavily
> used. Removing Tree.entries() is not
> possible I think, at least not
> without deprecating it first.
>
>
> +1 to officially deprecating it. I thought it
> was kind of unofficially deprecated already.
> (There's two functions that are almost
> identical, one is implemented in terms of the
> other, and it's only for "historical
> reasons"--that's pretty much what a deprecated
> function looks like :)
>
> Cheers,
>
> Jelmer
>
>
>
>
>
>
Attachment:
signature.asc
Description: This is a digitally signed message part
Follow ups
References