dulwich-users team mailing list archive
-
dulwich-users team
-
Mailing list archive
-
Message #00243
Re: Tree entries
On Sat, Oct 16, 2010 at 14:59, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:
> On Fri, 2010-10-08 at 12:51 -0700, David Borowitz wrote:
> > Another place with a completely different order: Tree.add(mode, name,
> > sha). I don't suppose that's one we can safely change, is it?
> I guess we could check the types of mode and name and reverse them (and
> print a DeprecationWarning) if they're wrong.
>
> It's not entirely nice but I think the cleanest way of fixing it.
>
Yeah. I don't have a sense of whether this is a thing other Python projects
do...
I'm happy to make this change but it's not on my critical path at the
moment.
> Cheers,
>
> Jelmer
>
> > On Thu, Oct 7, 2010 at 15:07, David Borowitz <dborowitz@xxxxxxxxxx>
> > wrote:
> > Oh, also, another reason I'm leaning towards namedtuples at
> > this point is that they're fully compatible with the existing
> > API. With a custom class, assuming it's faster (which I
> > doubt), we'd either have to:
> > -Deprecate Tree.iteritems(), making there be *three* methods
> > that do the same thing. (Plus all the code changes to use the
> > new API.)
> > -Implement __iter__, basically turning the custom class into a
> > namedtuple anyway.
> >
> >
> >
> > On Thu, Oct 7, 2010 at 14:59, David Borowitz
> > <dborowitz@xxxxxxxxxx> wrote:
> > I have some benchmark results for namedtuples vs.
> > tuples. First, the microbenchmark results:
> > $ python -m timeit -s 'from dulwich.objects import
> > TreeEntry; name = "foo/bar"; mode = 0100644; sha = "a"
> > * 20' 'x = TreeEntry(name, mode, sha)'
> > 1000000 loops, best of 3: 0.583 usec per loop
> > $ python -m timeit -s 'name = "foo/bar"; mode =
> > 0100644; sha = "a" * 20' 'x = (name, mode, sha)'
> > 10000000 loops, best of 3: 0.0753 usec per loop
> >
> >
> > Obviously the tuple constructor should win over
> > TreeEntry constructor, since the latter is a wrapper
> > around the former, and there's significant Python
> > function call overhead. But hey, 0.5us is still pretty
> > fast.
> >
> >
> > Then I ran a much bigger macrobenchmark (attached).
> > Basically, I cloned git.git, ran git unpack-object to
> > explode the repo into loose files, found all the tree
> > SHAs (without parsing the objects), then measured the
> > time to parse all those trees and iterate all their
> > entries. In the inner loop I also assigned all of the
> > tuple/namedtuple values to locals to check for
> > overhead there.
> >
> >
> > heapy was used to measure the heap. My workstation is
> > a Core 2 Quad with 8GB RAM running Ubuntu 10.04. (Note
> > that the whole heap fit easily in main memory.)
> >
> >
> > === namedtuple ===
> > $ python tree_entry_macro.py
> > Read 128900 objects
> > Collected 43000 trees
> > Iterated 42900 trees
> > Collected 9237351 entries in 180.18s. Heap results
> > follow.
> > Partition of a set of 36927479 objects. Total size =
> > 2318516600 bytes.
> > Index Count % Size % Cumulative % Kind
> > (class / dict of class)
> > 0 18452772 50 1273945424 55 1273945424 55 str
> > 1 9237351 25 738988080 32 2012933504 87 tuple
> > 2 9237352 25 221696448 10 2234629952 96 int
> > 3 1 0 83886152 4 2318516104 100 list
> > 4 1 0 448 0 2318516552 100
> > types.FrameType
> > 5 2 0 48 0 2318516600 100 float
> >
> >
> > === tuple ===
> > $ python tree_entry_macro.py
> > Read 128900 objects
> > Collected 43000 trees
> > Iterated 42900 trees
> > Collected 9237351 entries in 179.49s. Heap results
> > follow.
> > Partition of a set of 36927479 objects. Total size =
> > 2318516600 bytes.
> > Index Count % Size % Cumulative % Kind
> > (class / dict of class)
> > 0 18452772 50 1273945424 55 1273945424 55 str
> > 1 9237351 25 738988080 32 2012933504 87 tuple
> > 2 9237352 25 221696448 10 2234629952 96 int
> > 3 1 0 83886152 4 2318516104 100 list
> > 4 1 0 448 0 2318516552 100
> > types.FrameType
> > 5 2 0 48 0 2318516600 100 float
> >
> >
> > As expected, memory usage is identical, since
> > namedtuples are just tuples with no additional
> > per-instance memory overhead. Based on the
> > microbenchmark results alone, the expected overhead
> > for creating 9.2M TreeEntry objects alone is 9237351 *
> > (0.583 - 0.0753) = 4.7s. So I'm guessing the time
> > difference is not signifcant.
> >
> >
> > On Thu, Oct 7, 2010 at 08:46, David Borowitz
> > <dborowitz@xxxxxxxxxx> wrote:
> >
> >
> >
> > On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij
> > <jelmer@xxxxxxxxx> wrote:
> > On Wed, 2010-10-06 at 14:19 -0700,
> > David Borowitz wrote:
> > > In the stuff I'm doing on rename
> > detection, I've been bitten a few
> > > times by the weird ordering of
> > (name, mode, sha) tuples coming out of
> > > various tree iteration methods.
> >
> > > Currently, we have:
> > > Tree.entries() yields (mode, name,
> > sha)
> > > Tree.iteritems() and
> > BaseObjectStore.iter_tree_contents()
> > yield (name,
> > > mode, sha)
> > > index.commit_tree() takes (name,
> > sha, mode)
> >
> > This is all for hysterical raisins,
> > unfortunately. I'm all for fixing
> > it :-)
> >
> > > We should really standardize this,
> > preferably in one of three ways:
> > > 1. Use (name, mode, sha) everywhere.
> > > 2. Use a namedtuple of (name, mode,
> > sha).
> > > 2. Use a new data object.
> > > I would prefer using a namedtuple
> > would make some of the code I'm
> > > about to write cleaner. It would
> > also be backwards-compatible with the
> > > appropriate type. The downside is
> > that it's another feature that's new
> > > in 2.6.
> >
> > It doesn't really help that you have
> > two options 2 ! :-)
> >
> > This is a really performance-sensitive
> > area (at least for me), so if we
> > go for anything else than a tuple we
> > should make sure it doesn't have a
> > (significant) impact on performance.
> >
> > I don't think we should drop support
> > for pre-2.6 Python yet; we could of
> > course provide a custom implementation
> > for those who run 2.4 or 2.5, but
> > that seems like more trouble than (3).
> >
> >
> > Augie has told me from experience with hg that
> > data objects even with slots have worse
> > performance memory-wise than namedtuples. This
> > is presumably because there's a per-object
> > overhead for the slots, whereas namedtuples
> > have properties that live in the class dict.
> > I'm happy to run some benchmarks
> >
> >
> > Another thing about the namedtuple function is
> > that you can make it print out the class
> > definition it creates, which provides an easy
> > way to backport specific namedtuple types to
> > previous versions.
> >
> > 3 seems like the best solution if it
> > doesn't make things too much
> > slower. Of course, we could give it
> > semantics similar to what namedtuple
> > would give us.
> >
> >
> > I don't mind much either way. It sounds like
> > benchmark results will be the deciding factor.
> >
> > > In my ideal world we would get rid
> > of Tree.entries() and change
> > > index.commit_tree() to use the
> > standard format, but I don't have a
> > > sense of how widely either of those
> > are used. Thoughts on how to
> > > proceed?
> >
> > Changing commit_tree() is probably
> > possible as it's not very heavily
> > used. Removing Tree.entries() is not
> > possible I think, at least not
> > without deprecating it first.
> >
> >
> > +1 to officially deprecating it. I thought it
> > was kind of unofficially deprecated already.
> > (There's two functions that are almost
> > identical, one is implemented in terms of the
> > other, and it's only for "historical
> > reasons"--that's pretty much what a deprecated
> > function looks like :)
> >
> > Cheers,
> >
> > Jelmer
> >
> >
> >
> >
> >
> >
>
>
Follow ups
References