← Back to team overview

dulwich-users team mailing list archive

Re: [PATCH 0/33] Rewrite and speed up pack inflation

 

Hi Dave,

On Tue, Jul 26, 2011 at 07:02:58PM -0700, dborowitz@xxxxxxxxxx wrote:
> The first of several patchbombs of code in use by the servers at code.google.com :)

> This one accomplishes two big things:
> 1. A big rewrite of the pack inflation and indexing code, which organizes reads
>    around chains of deltas. This guarantees that each object is read and inflated
>    exactly once by organizing reads around delta chains. Overall, improves pack
>    indexing performance of large packs like linux-2.6.git by at least 4x, and
>    is "only" 2-3x slower than the optimized C git implementation
> 2. Adding an UnpackedGitObject that encapsulates some of the data formerly
>    passed around as tuples from the various functions in pack.py. Rather than
>    constant multiple return value packing and unpacking, just pass around single
>    objects and mutate their state.

> Various additional cleanups on top of these.

Those are some really nice improvements, thanks. :-)

I don't really have any questions or requests for the overall changes.

The main thing that would be nice to do is make sure all changes
to reasonably public APIs are documented in the NEWS file. I can add
these as I apply the patches.
Perhaps we should also document our de-facto policy on API changes
somewhere. So far it has been something along the lines of:

 - Everything that has a name starting with an underscore is
   considered private and fair game for backwards incompatible changes
 - For other code we try to document the API changes in NEWS
 - For commonly used APIs we deprecate first (and add deprecation
   warnings), then remove the functionality a couple of releases
   later. What is considered commonly used is a bit vague and up to
   our judgement.

Do we care about breakages in the testsuite utility functions, etc?

Do you agree that's what we've been doing, and does it seem reasonable
as a policy?

Are you both dborowitz@xxxxxxxxxx and ddborowitz@xxxxxxxxxx, or is the
second a typo? Your emails seem to use the first but your patches use
the second.

> 4c6e275 pack: Standardize on single quotes.
ok

> 8ec2987 pack: Clean up unpack_object.
ok

> 9b3fc71 pack: Compute CRC32 during object unpacking.
ok.  This changes the behaviour of take_msb_bytes, read_zlib_chunks,
unpack_object and PackStreamReader.read_objects as they now return a tuple or one
more element in the tuple they already returned.

I'm also a bit surprised this didn't require more tests to be updated.
Your patch doesn't make the situation worse, it'd be nice to add more
tests at some point.

> de42ceb pack: Inline PackObjectIterator.
ok

> 764ec04 object_store: Fix return type of MemoryObjectStore.get_raw.
ok

> 3c4ec6a pack: Add a DeltaChainIterator for faster iteration of PackData.
ok

> 5cd5f24 Make the server thread raise errors in compat tests.
ok

> 8dceaac server: Fix short-circuit behavior for no-op fetches.
ok

> 5651324 pack: Add a PackIndexer to index packs more quickly.
ok

> 5690a9b pack: PackStreamReader SHA calculation and docstring cleanup.
ok

> 7caefca pack: Expose which refs were external in DeltaChainIterator.
It seems a bit odd that the tests look at the private variable for the
ext refs here rather than using the public accessor method. Is there a
particular reason for that?

> 280f4b0 pack: Allow write_pack_object to compute a SHA.
ok

> f2000f2 server: Make PackStreamCopier optionally record delta chains.
ok.

(Looks like PackStreamCopier and PackStreamReader could also use some
more tests)

> 3ded386 tests: Move write_pack_data to utils.build_pack.
ok

> b18c613 misc: Add SEEK_CUR.
ok

> 2e0ffd3 pack: Include offset in PackStreamReader results.
This would also be worth a NEWS entry.

> d8eb15a tests/utils: Pass a file object into build_pack.
ok

> 0d9deaa Move PackStreamReader from server to pack.
This seems to move PackStreamCopier rather than PackStreamReader?

> 1e8d184 pack: use SEEK_END for PackData.get_stored_checksum().
ok

> b032b1e test_pack: Test checksum and length mismatch conditions.
testing ftw

> 9286dd6 pack: Extract a method to check pack length and SHA.
ok

> fe363ec pack: Extract a function to compute the SHA of a file.
ok

> 17e1378 Rewrite add_thin_pack to use the fast PackIndexer.
This requires some updates in NEWS as
ObjectStore.add_thin_pack has a different API and
ObjectStore.move_in_thin_pack has disappeared.

> 9facb62 pack: Pass a zlib buffer size through to read_zlib_chunks.
ok

> 870e006 pack: Fix a buffering issue with PackStreamReader; add tests.
ok

> 40b145c pack: Add PackInflater to quickly inflate pack objects.
ok

> 3f7b7dc pack: Nuke ThinPackData.
This will need to be documented in NEWS. IIRC we use it in bzr-git.

> a2a6078 _compat: Use namedtuple recipe rather than hard-coding.
We should note the license under which the code fragment is available.
That also appears to be missing for permutations().

> 911076c _compat: Inline specific namedtuple instances.
ok

> 23095f8 pack: Create an _UnpackedObject for better encapsulation.
Nice!

Perhaps this should be public, I think I would want to use it in bzr-git.

This also isn't backwards compatible, so would need update to NEWS.

> 00e2b20 pack: Add option to include compressed data in _UnpackedObjects.
ok

> 91aba9d pack: Remove comp_len from _UnpackedObject.
ok

> d0174c1 pack: Extract a function to write a packed object header.
ok

Cheers,

Jelmer

Attachment: signature.asc
Description: Digital signature


Follow ups

References