syncany-team team mailing list archive

Thread
Date

Re: reusing core code

To: funky dev <funkydevnull@xxxxxxxxx>
From: Philipp Heckel <philipp.heckel@xxxxxxxxx>
Date: Tue, 3 Jan 2012 20:25:28 +0100
Cc: syncany-team@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CACqVf73W3oPWfRYdgWnEM2B9Eqxjrg2MMT3Cn0JN5GLQn5hVUQ@mail.gmail.com>

Hi there,

happy new year to you all!!

I certainly should be able to offer a short explanation -- since this is
the exact topic of my thesis :-D I'll post a link as soon as it's ready.
End of January at the latest.

Here's a short explanation: there are two strategies to minimize (a) disk
storage and (b) upload duration (= maximize upload bandwidth):

a) Effective chunking via deduplication. In the thesis, I test over 140
different algorithm configurations to pick an algorithm that creates the
smallest overall total size (sum of all chunks) -- and at the same time is
CPU friendly (minimal CPU usage) and doesn't take forever (minimal chunking
duration).

b) combine smaller chunks to metachunks (= multiple chunks in a bigger
chunk; could be just a ZIP file) to minimize upload latency and per-request
overhead (for different storage types, e.g. FTP, S3, IMAP, ...).

Here's the gist of the results -- the best configuration is:
- the variable-size chunking method TTTD (as opposed to a fixed-size
chunker)
- using the fingerprinting algorithm Adler32 (as opposed to Rabin or PLAIN)
- with a chunk size of 4 KB (as opposed to 8 KB or 16 KB)
- combined to a 250 KB or 125 KB metachunk
- and then compressed using Gzip (as opposed to Bzip2 or no compression)
- using a dynamic sleep time in between the metachunks to keep CPU usage
low (as opposed to fixed sleep time of 30ms or no sleep time)

Making an API out of the chunking stuff is certainly a good idea. And as a
matter of fact, I already sort of did that. I'll publish everything
eventually. As soon as the thesis is done, I'll have some time :-D

Let me know if you'd like a peek in the thesis. I'll be glad to send you
the current version.
I hope I could enlighten you a bit -- :-D

Cheers,
Philipp

On Tue, Jan 3, 2012 at 6:09 PM, funky dev <funkydevnull@xxxxxxxxx> wrote:

> I have a request I'd like to make as I see that the core functionality is
> now being separated from the GUI.  I think it would be useful to have some
> aspects such as the file chunking/encryption algorithm separated out into a
> reusable java api.  I don't believe there is an api available with this
> functionality and I believe it is rather non-trivial to chunk a file in
> such a way as to maximize re-use of previously used chunks (assuming this
> functionality is in syncanny).  In fact I would be interested to learn how
> the current code efficiently rechunks a file to minimize the amount of data
> that needs to be sent over the network  E.g. if some data is inserted in
> the middle of the file we obviously don't want to re-upload all chunks
> after this point but this seems non-trivial to implement.  Can anyone offer
> a short explanation?
>
> thanks!
>
> --
> Mailing list: https://launchpad.net/~syncany-team
> Post to     : syncany-team@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~syncany-team
> More help   : https://help.launchpad.net/ListHelp
>
>

-- 
H3 11, 68159 Mannheim, Germany
Landline: +49-621-98181881 / Cell phone: +49-173-7865012 (o2 network)
Public Key: 0xBD5E122F at pgp.mit.edu

Follow ups

Re: reusing core code
From: funkydevnull, 2012-01-05

References

reusing core code
From: funky dev, 2012-01-03