dulwich-users team mailing list archive
-
dulwich-users team
-
Mailing list archive
-
Message #00234
Re: Tree entries
I have some benchmark results for namedtuples vs. tuples. First, the
microbenchmark results:
$ python -m timeit -s 'from dulwich.objects import TreeEntry; name =
"foo/bar"; mode = 0100644; sha = "a" * 20' 'x = TreeEntry(name, mode, sha)'
1000000 loops, best of 3: 0.583 usec per loop
$ python -m timeit -s 'name = "foo/bar"; mode = 0100644; sha = "a" * 20' 'x
= (name, mode, sha)'
10000000 loops, best of 3: 0.0753 usec per loop
Obviously the tuple constructor should win over TreeEntry constructor, since
the latter is a wrapper around the former, and there's significant Python
function call overhead. But hey, 0.5us is still pretty fast.
Then I ran a much bigger macrobenchmark (attached). Basically, I cloned
git.git, ran git unpack-object to explode the repo into loose files, found
all the tree SHAs (without parsing the objects), then measured the time to
parse all those trees and iterate all their entries. In the inner loop I
also assigned all of the tuple/namedtuple values to locals to check for
overhead there.
heapy was used to measure the heap. My workstation is a Core 2 Quad with 8GB
RAM running Ubuntu 10.04. (Note that the whole heap fit easily in main
memory.)
=== namedtuple ===
$ python tree_entry_macro.py
Read 128900 objects
Collected 43000 trees
Iterated 42900 trees
Collected 9237351 entries in 180.18s. Heap results follow.
Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 18452772 50 1273945424 55 1273945424 55 str
1 9237351 25 738988080 32 2012933504 87 tuple
2 9237352 25 221696448 10 2234629952 96 int
3 1 0 83886152 4 2318516104 100 list
4 1 0 448 0 2318516552 100 types.FrameType
5 2 0 48 0 2318516600 100 float
=== tuple ===
$ python tree_entry_macro.py
Read 128900 objects
Collected 43000 trees
Iterated 42900 trees
Collected 9237351 entries in 179.49s. Heap results follow.
Partition of a set of 36927479 objects. Total size = 2318516600 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 18452772 50 1273945424 55 1273945424 55 str
1 9237351 25 738988080 32 2012933504 87 tuple
2 9237352 25 221696448 10 2234629952 96 int
3 1 0 83886152 4 2318516104 100 list
4 1 0 448 0 2318516552 100 types.FrameType
5 2 0 48 0 2318516600 100 float
As expected, memory usage is identical, since namedtuples are just tuples
with no additional per-instance memory overhead. Based on the microbenchmark
results alone, the expected overhead for creating 9.2M TreeEntry objects
alone is 9237351 * (0.583 - 0.0753) = 4.7s. So I'm guessing the time
difference is not signifcant.
On Thu, Oct 7, 2010 at 08:46, David Borowitz <dborowitz@xxxxxxxxxx> wrote:
>
>
> On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:
>
>> On Wed, 2010-10-06 at 14:19 -0700, David Borowitz wrote:
>> > In the stuff I'm doing on rename detection, I've been bitten a few
>> > times by the weird ordering of (name, mode, sha) tuples coming out of
>> > various tree iteration methods.
>>
>> > Currently, we have:
>> > Tree.entries() yields (mode, name, sha)
>> > Tree.iteritems() and BaseObjectStore.iter_tree_contents() yield (name,
>> > mode, sha)
>> > index.commit_tree() takes (name, sha, mode)
>> This is all for hysterical raisins, unfortunately. I'm all for fixing
>> it :-)
>>
>> > We should really standardize this, preferably in one of three ways:
>> > 1. Use (name, mode, sha) everywhere.
>> > 2. Use a namedtuple of (name, mode, sha).
>> > 2. Use a new data object.
>> > I would prefer using a namedtuple would make some of the code I'm
>> > about to write cleaner. It would also be backwards-compatible with the
>> > appropriate type. The downside is that it's another feature that's new
>> > in 2.6.
>> It doesn't really help that you have two options 2 ! :-)
>>
>> This is a really performance-sensitive area (at least for me), so if we
>> go for anything else than a tuple we should make sure it doesn't have a
>> (significant) impact on performance.
>>
>> I don't think we should drop support for pre-2.6 Python yet; we could of
>> course provide a custom implementation for those who run 2.4 or 2.5, but
>> that seems like more trouble than (3).
>>
>
> Augie has told me from experience with hg that data objects even with slots
> have worse performance memory-wise than namedtuples. This is presumably
> because there's a per-object overhead for the slots, whereas namedtuples
> have properties that live in the class dict. I'm happy to run some
> benchmarks
>
> Another thing about the namedtuple function is that you can make it print
> out the class definition it creates, which provides an easy way to backport
> specific namedtuple types to previous versions.
>
>
>> 3 seems like the best solution if it doesn't make things too much
>> slower. Of course, we could give it semantics similar to what namedtuple
>> would give us.
>
>
> I don't mind much either way. It sounds like benchmark results will be the
> deciding factor.
>
>
>> > In my ideal world we would get rid of Tree.entries() and change
>> > index.commit_tree() to use the standard format, but I don't have a
>> > sense of how widely either of those are used. Thoughts on how to
>> > proceed?
>> Changing commit_tree() is probably possible as it's not very heavily
>> used. Removing Tree.entries() is not possible I think, at least not
>> without deprecating it first.
>>
>
> +1 to officially deprecating it. I thought it was kind of unofficially
> deprecated already. (There's two functions that are almost identical, one is
> implemented in terms of the other, and it's only for "historical
> reasons"--that's pretty much what a deprecated function looks like :)
>
>
>> Cheers,
>>
>> Jelmer
>>
>
>
#!/usr/bin/python
#
# Copyright 2010 Google Inc. All Rights Reserved.
PATH = '/home/dborowitz/d/git.loose'
import os
import sys
import time
from dulwich import objects
from dulwich import repo
import guppy
def main():
hp = guppy.hpy()
r = repo.Repo('/home/dborowitz/d/git.loose')
tree_shas = []
for i, sha in enumerate(r.object_store):
if not i % 1000:
sys.stdout.write('\rRead %i objects' % i)
sys.stdout.flush()
with open(os.path.join(r.object_store.path, sha[:2], sha[2:])) as f:
# Parse the file header but not the contents.
obj = objects.ShaFile._parse_legacy_object_header('', f)
if isinstance(obj, objects.Tree):
tree_shas.append(sha)
sys.stdout.write('\nCollected %i trees\n' % len(tree_shas))
hp.setrelheap()
entries = []
start_time = time.time()
for i, tree_sha in enumerate(tree_shas):
if not i % 100:
sys.stdout.write('\rParsed %i trees' % i)
sys.stdout.flush()
for entry in r.object_store[tree_sha].iteritems():
entries.append(entry)
if type(entry) == tuple:
name, mode, sha = entry
else:
name = entry.name
mode = entry.mode
sha = entry.sha
end_time = time.time()
del tree_shas
sys.stdout.write('\nCollected %i entries in %.2fs. Heap results follow.\n' %
(len(entries), end_time - start_time))
print hp.heap()
if __name__ == '__main__':
main()
Follow ups
References