← Back to team overview

dulwich-users team mailing list archive

Re: [PATCH 0/33] Rewrite and speed up pack inflation

 

I'll mail another copy of these with the right email adresses and NEWS
entries. Hopefully tonight (depending on OSCON :)

On Wed, Jul 27, 2011 at 14:29, Jelmer Vernooij <jelmer@xxxxxxxxx> wrote:

> Hi Dave,
>
> On Tue, Jul 26, 2011 at 07:02:58PM -0700, dborowitz@xxxxxxxxxx wrote:
> > The first of several patchbombs of code in use by the servers at
> code.google.com :)
>
> > This one accomplishes two big things:
> > 1. A big rewrite of the pack inflation and indexing code, which organizes
> reads
> >    around chains of deltas. This guarantees that each object is read and
> inflated
> >    exactly once by organizing reads around delta chains. Overall,
> improves pack
> >    indexing performance of large packs like linux-2.6.git by at least 4x,
> and
> >    is "only" 2-3x slower than the optimized C git implementation
> > 2. Adding an UnpackedGitObject that encapsulates some of the data
> formerly
> >    passed around as tuples from the various functions in pack.py. Rather
> than
> >    constant multiple return value packing and unpacking, just pass around
> single
> >    objects and mutate their state.
>
> > Various additional cleanups on top of these.
>
> Those are some really nice improvements, thanks. :-)
>
> I don't really have any questions or requests for the overall changes.
>
> The main thing that would be nice to do is make sure all changes
> to reasonably public APIs are documented in the NEWS file. I can add
> these as I apply the patches.
> Perhaps we should also document our de-facto policy on API changes
> somewhere. So far it has been something along the lines of:
>
>  - Everything that has a name starting with an underscore is
>   considered private and fair game for backwards incompatible changes
>  - For other code we try to document the API changes in NEWS

 - For commonly used APIs we deprecate first (and add deprecation
>   warnings), then remove the functionality a couple of releases
>   later. What is considered commonly used is a bit vague and up to
>   our judgement.
>

Agreed, and you're certainly in a better place to give an estimate than I
am. I'm happy to deprecate and add "new" versions of some of the functions
in pack.py I was messing with, if you have any in mind.


> Do we care about breakages in the testsuite utility functions, etc?
>

Good question. I think public functions in test utils should fall in the
same category as any other public functions.


> Do you agree that's what we've been doing, and does it seem reasonable
> as a policy?
>

I agree that it's what you've been doing (though I've been remiss), and it
certainly seems reasonable as a general policy. I'd love to see it in
HACKING so I can refer to it in the future.

As part of this, we may want to scrub through a few modules and decide to
move some functions to private.


> Are you both dborowitz@xxxxxxxxxx and ddborowitz@xxxxxxxxxx, or is the
> second a typo? Your emails seem to use the first but your patches use
> the second.


dborowitz@xxxxxxxxxx and ddborowitz@xxxxxxxxx are both me. If you ever see
ddborowitz@xxxxxxxxxx (and I didn't see any in this series), that is
nonexistent, but I don't think I have that set in any of my .gitconfigs.


> > 4c6e275 pack: Standardize on single quotes.
> ok
>
> > 8ec2987 pack: Clean up unpack_object.
> ok
>
> > 9b3fc71 pack: Compute CRC32 during object unpacking.
> ok.  This changes the behaviour of take_msb_bytes, read_zlib_chunks,
> unpack_object and PackStreamReader.read_objects as they now return a tuple
> or one
> more element in the tuple they already returned.
>
> I'm also a bit surprised this didn't require more tests to be updated.
> Your patch doesn't make the situation worse, it'd be nice to add more
> tests at some point.
>

Yes, more tests are good, sorry about that. Do you mind if those come in
some followup changes to cut down on conflicts within the series?


> > de42ceb pack: Inline PackObjectIterator.
> ok
>
> > 764ec04 object_store: Fix return type of MemoryObjectStore.get_raw.
> ok
>
> > 3c4ec6a pack: Add a DeltaChainIterator for faster iteration of PackData.
> ok
>
> > 5cd5f24 Make the server thread raise errors in compat tests.
> ok
>
> > 8dceaac server: Fix short-circuit behavior for no-op fetches.
> ok
>
> > 5651324 pack: Add a PackIndexer to index packs more quickly.
> ok
>
> > 5690a9b pack: PackStreamReader SHA calculation and docstring cleanup.
> ok
>
> > 7caefca pack: Expose which refs were external in DeltaChainIterator.
> It seems a bit odd that the tests look at the private variable for the
> ext refs here rather than using the public accessor method. Is there a
> particular reason for that?


Almost certainly not, but I'll be honest and say I don't remember.


>  > 280f4b0 pack: Allow write_pack_object to compute a SHA.
> ok
>
> > f2000f2 server: Make PackStreamCopier optionally record delta chains.
> ok.
>
> (Looks like PackStreamCopier and PackStreamReader could also use some
> more tests)


Agreed.


> > 3ded386 tests: Move write_pack_data to utils.build_pack.
> ok
>
> > b18c613 misc: Add SEEK_CUR.
> ok
>
> > 2e0ffd3 pack: Include offset in PackStreamReader results.
> This would also be worth a NEWS entry.
>
> > d8eb15a tests/utils: Pass a file object into build_pack.
> ok
>
> > 0d9deaa Move PackStreamReader from server to pack.
> This seems to move PackStreamCopier rather than PackStreamReader?


Oops.


>  > 1e8d184 pack: use SEEK_END for PackData.get_stored_checksum().
> ok
>
> > b032b1e test_pack: Test checksum and length mismatch conditions.
> testing ftw
>
> > 9286dd6 pack: Extract a method to check pack length and SHA.
> ok
>
> > fe363ec pack: Extract a function to compute the SHA of a file.
> ok
>
> > 17e1378 Rewrite add_thin_pack to use the fast PackIndexer.
> This requires some updates in NEWS as
> ObjectStore.add_thin_pack has a different API and
> ObjectStore.move_in_thin_pack has disappeared.
>
> > 9facb62 pack: Pass a zlib buffer size through to read_zlib_chunks.
> ok
>
> > 870e006 pack: Fix a buffering issue with PackStreamReader; add tests.
> ok
>
> > 40b145c pack: Add PackInflater to quickly inflate pack objects.
> ok
>
> > 3f7b7dc pack: Nuke ThinPackData.
> This will need to be documented in NEWS. IIRC we use it in bzr-git.


This is a fine candidate for deprecation, I don't mind leaving it in.


>  > a2a6078 _compat: Use namedtuple recipe rather than hard-coding.
> We should note the license under which the code fragment is available.
> That also appears to be missing for permutations().


Hm, I thought the PSF copyright was sufficient for PSL software, but I see
no harm in mentioning it. permutations() is also PSL since it's straight
from the itertools docs.


> > 911076c _compat: Inline specific namedtuple instances.
> ok
>
> > 23095f8 pack: Create an _UnpackedObject for better encapsulation.
> Nice!
>
> Perhaps this should be public, I think I would want to use it in bzr-git.
>

That sounds reasonable. I was initially thinking that callers shouldn't be
creating these themselves (though they are fair game as return values), but
on further experience it seems like a reasonable thing to make public.


> This also isn't backwards compatible, so would need update to NEWS.


I'll leave it up to your call on this, but IMO it would be pretty ugly to go
the deprecation route here; we would end up with two basically-identical,
moderately complex code paths. I also think UnpackedObjects are enough
cleaner that people would hopefully be motivated to switch :)


> > 00e2b20 pack: Add option to include compressed data in _UnpackedObjects.
> ok
>
> > 91aba9d pack: Remove comp_len from _UnpackedObject.
> ok
>
> > d0174c1 pack: Extract a function to write a packed object header.
> ok
>
> Cheers,
>
> Jelmer
>

Follow ups

References