← Back to team overview

syncany-team team mailing list archive

Re: reusing core code

 

Thanks a lot for the response.  I would definitely like a peek at your
thesis (congratulations btw!).

I think my question pertains to your point (a) but I'm not sure I see
what in your list of results addresses my issue.  Basically I am looking
for an algorithm (or better yet, an implementation) that would take as
input a given file and also a set of chunks corresponding to an old
version of the same file.  As output it would produce an efficient set
of new chunks to encode the difference between the two files.  The
"rechunking" of the file should preserve, as much as possible, the old
chunks since these have already been sent over the network and stored
elsewhere.  I guess this is what you mean by "Effective chunking via
deduplication" but I don't quite see how that's being addressed in your
list of results.

Actually I should clarify my posing of the problem above.  Because the
old chunks are stored over the network we don't necessarily have access
to them but rather just their hashes and (optionally) some small amount
of data like a few bytes from the beginning/end of each chunk (all this
could be stored in some metadata that would be fast to download).  This
is an important point because if we had the old chunks in totality we
could of course just regenerate the old file, run a diff, and just chunk
the diff.  I believe this is similar to the problem Syncanny has to deal
with, no?

I have an algorithm and a partial implementation that I would be happy
to elaborate on but I'm guessing its not going to compete well with your
140 tested algorithms and whole dissertation's worth of research so I'm
eager to hear how you solved this problem :-)

Many thanks!

On 01/03/2012 08:25 PM, Philipp Heckel wrote:
> Hi there,
>
> happy new year to you all!!
>
> I certainly should be able to offer a short explanation -- since this
> is the exact topic of my thesis :-D I'll post a link as soon as it's
> ready. End of January at the latest.
>
> Here's a short explanation: there are two strategies to minimize (a)
> disk storage and (b) upload duration (= maximize upload bandwidth):
>
> a) Effective chunking via deduplication. In the thesis, I test over
> 140 different algorithm configurations to pick an algorithm that
> creates the smallest overall total size (sum of all chunks) -- and at
> the same time is CPU friendly (minimal CPU usage) and doesn't take
> forever (minimal chunking duration).
>
> b) combine smaller chunks to metachunks (= multiple chunks in a bigger
> chunk; could be just a ZIP file) to minimize upload latency and
> per-request overhead (for different storage types, e.g. FTP, S3, IMAP,
> ...).
>
> Here's the gist of the results -- the best configuration is:
> - the variable-size chunking method TTTD (as opposed to a fixed-size
> chunker)
> - using the fingerprinting algorithm Adler32 (as opposed to Rabin or
> PLAIN)
> - with a chunk size of 4 KB (as opposed to 8 KB or 16 KB)
> - combined to a 250 KB or 125 KB metachunk
> - and then compressed using Gzip (as opposed to Bzip2 or no compression)
> - using a dynamic sleep time in between the metachunks to keep CPU
> usage low (as opposed to fixed sleep time of 30ms or no sleep time)
>
>
> Making an API out of the chunking stuff is certainly a good idea. And
> as a matter of fact, I already sort of did that. I'll publish
> everything eventually. As soon as the thesis is done, I'll have some
> time :-D
>
> Let me know if you'd like a peek in the thesis. I'll be glad to send
> you the current version.
> I hope I could enlighten you a bit -- :-D
>
> Cheers,
> Philipp
>
> On Tue, Jan 3, 2012 at 6:09 PM, funky dev <funkydevnull@xxxxxxxxx
> <mailto:funkydevnull@xxxxxxxxx>> wrote:
>
>     I have a request I'd like to make as I see that the core
>     functionality is now being separated from the GUI.  I think it
>     would be useful to have some aspects such as the file
>     chunking/encryption algorithm separated out into a reusable java
>     api.  I don't believe there is an api available with this
>     functionality and I believe it is rather non-trivial to chunk a
>     file in such a way as to maximize re-use of previously used chunks
>     (assuming this functionality is in syncanny).  In fact I would be
>     interested to learn how the current code efficiently rechunks a
>     file to minimize the amount of data that needs to be sent over the
>     network  E.g. if some data is inserted in the middle of the file
>     we obviously don't want to re-upload all chunks after this point
>     but this seems non-trivial to implement.  Can anyone offer a short
>     explanation?
>
>     thanks!
>
>     --
>     Mailing list: https://launchpad.net/~syncany-team
>     <https://launchpad.net/%7Esyncany-team>
>     Post to     : syncany-team@xxxxxxxxxxxxxxxxxxx
>     <mailto:syncany-team@xxxxxxxxxxxxxxxxxxx>
>     Unsubscribe : https://launchpad.net/~syncany-team
>     <https://launchpad.net/%7Esyncany-team>
>     More help   : https://help.launchpad.net/ListHelp
>
>
>
>
> -- 
> H3 11, 68159 Mannheim, Germany
> Landline: +49-621-98181881 / Cell phone: +49-173-7865012 (o2 network)
> Public Key: 0xBD5E122F at pgp.mit.edu <http://pgp.mit.edu>



References