← Back to team overview

syncany-team team mailing list archive

Re: deduplication, sharing, conflict resolution?

 

Hallo Dieter,

those are all (!!) very good questions and in fact those are the core
things that I'll concentrate my work on in the future. I'm actually even
writing my Master thesis about "minimizing bandwidth and disk usage for
arbitrary storage types using deduplication and meta-chunking" (working
title).

> 1) how do you handle deduplication on the storage layer and the
> networking layer? (like, if user changes 2 random bytes in a 50MB file,
> or renames a file, what kind of network traffic does this cause, and
> what are the implications on storage consumption?)
> [...]
> it seems to me that using backends like ftp are very limiting factors
> because some protocols are really dumb wrt. efficiency.
> do you upload all files in small parts (how small?) to the ftp, as to
> minimize the needed syncing for minimal changes in a big file?

At the current stage, Syncany uses a fixed size chunking mechanism with
configurable chunk size. In your example, if who bytes were changed and
the chunk size was 512kb, 1MB would have to be transferred. And it gets
worse: if one byte is added in the beginning of the file, all subsequent
chunks change and must be retransmitted. This is of course not desirable
at all, since it requires a significant overhead.

In the future I'm planning to use (a) a sliding window based chunking
algorithm, e.g. based on Rabin fingerprinting, (b) with very small
chunks (8-16 KB). The algorithm I'll try first will be based on the "Two
Threshold Two Divisor" algorithm [1].

That way, as a result of (a), if a byte was added in the beginning, only
one or two chunks would change and would have to be retransmitted. As a
result of (b), those chunks would be significantly smaller than in the
current version (in your example maybe 16-32KB).

To counteract the overhead per connection (one request per chunk), I
intend to combine chunks to meta-chunks before uploading them to the
storage.

> is it  supported to version everything (i.e. keep x
> (or infinite) versions of > all files?).

Right now, Syncany has no "cleaning" method to delete old revisions, so
it keeps all changes from the first day on. It can assemble every from
the chunks on the remote repository.

> do you encrypt small blocks of the file? because re-encrypting a file
> that has only changed a little will yield a completely different
> encrypted variant, or not?

I first chunk the file, and then encrypt the chunks that result from the
files. If a small parts has changed, I can detect which chunks have
changed, then take them, encrypt them and upload them.

> maybe you could use rolling checksums like rsync does (but even that is
> not ideal), 

I was using a rsyncs rolling checksum algorithm for a while, but
switched to Adler32 for some reason. You can look at the current chunker
code [2] and the not yet working TTTD chunker [3] if you like.

> AFAIK git actually has a pretty efficient blob storage and
> synchronisation system, you could put encrypted blobs in git's storage
> system and get some features for free or camlistore... http://camlistore.org/

Problem with Git is that I would limit the use to only one protocol.
Part of Syncany's goal is to support any storage out there.

I briefly looked at camlistore. I suppose we could use it as storage,
but since it's in early development as well, I think it's too early for
that.

> and what about when you keep, say 2GB in syncany (without modifying any
> file or doing anything special), will it cause an additional 2GB (or
> more) storage overhead, because it also needs to locally store the
> encryted variant of all files?

Syncany has a local cache that is only used temporarily to download
chunks and meta files. Once they are processed, they could be deleted
without causing any harm.

At the moment, the cache is never cleaned, but that'll come sooner or
later :-)

> 2) how do you handle sharing between several users? 

At the moment, Syncany assumes that all users who can access a
repository AND have the password have full access to all files.

That is, if you only have access to the repository, you can delete all
files, but not read them. If you only have the password, you could
decrypt the files, but you cannot access them. If you have both, you
have full access.

I was thinking about cryptographic access control for a while, but it
seems a lot of effort. Maybe some time in the future.

> how about conflicts? are there means for manual and automatic conflict
> resolving? 

Syncany does conflict resolution similar to Dropbox. If two users change
the same file at the same time, it detects that and resolves the
conflict by renaming the "loosing" file to "... (conflicted copy, ..)".
The winner is the client that changed the file first (currently: local
time; will be vector time or lamport time later).


I hope I could help!
If you have any suggestions, please let me know!

Cheers,
Philipp


[1] http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf
[2]
http://bazaar.launchpad.net/~binwiederhier/syncany/trunk/view/head:/syncany/src/org/syncany/index/Chunker.java
[3]
http://bazaar.launchpad.net/~binwiederhier/syncany/trunk/view/head:/syncany/src/org/syncany/index/TTTDChunker.java


References