acmeattic-devel team mailing list archive
-
acmeattic-devel team
-
Mailing list archive
-
Message #00026
Re: Modifications to versioning system proposal
On Thu, Jul 8, 2010 at 2:17 PM, Aditya Manthramurthy
<aditya.mmy@xxxxxxxxx>wrote:
> Let me summarise the versioning system proposal so far:
>
> 1. Forwards diffs are kept. So if the various versions of a file F are
> F1, F2, F3, etc, then the server stores: F1, d(F1,F2), d(F2, F3), ...
> 2. Because of storing forward diffs, the first checkout to a client
> will require a large download to get to the latest copy of a file. To
> decrease the size of the download, we could decide to store the whole file
> at a point, instead of a diff. This would look like (on the server), F1,
> d(F1,F2), d(F2,F3), F4, d(F4, F5), etc. A simple metric to decide when to
> store the whole file would be to see if the size of F1 + d(F1,F2) + d(F2,
> F3) is greater than F4.
>
> This is called Snapshot. Can we use this terminology?
>
> 1. All diffs and files are sent encrypted to the server.
> 2. Text files can use normal line based diff algorithms. Binary files
> won't generate good diffs with such algorithms. Instead we can use an
> rsync-like algorithm [1] for binary files. It will generate better diffs.
> For early releases, we can compromise to just store the full files of
> successive revisions for binary files.
>
> On this note, Mercurial uses a optimized C implementation of difflib which
is used to compute diff's for both text and binary. One of Mercurial's
beauty is handling both text and binary storage the exact same way. If we
adopt this or a similar algorithm, we can probably do the same.
>From the rsync algorithm, are you suggesting that we do not compute the
diff's at the client without server intervention? If you are, then the
server copy is encrypted which makes our lives hard. If you are not, why the
reference to this rsync algorithm for just binary files?
I do not understand why binary files necessitate the usage of a
bi-directional diff between client and server.
On the same note, we do need a good scheme to handle diff at the client with
or without file copies. I wonder how DropBox and SpiderOak solve this.
>
> 1.
> 2. All this does seem like we need a custom version management module,
> instead of reusing one like hg, git or bzr. But I think performance and
> flexibility-wise, we are better off writing our own. We can always borrow
> code from these other s/w.
> 3. The number of versions, etc to store is still not decided. We could
> perhaps discuss this more. Issues related to this:
> 1. The server sees only encrypted data. So it will not be able to
> process diff files to compact, etc, at least with the current encryption
> scheme of using AES.
>
> By compaction, I am assuming you are referring to the periodic procedure
that forgets revisions. Compaction somehow sounds like compression to me
which is done by VCS's (zip, bzip).
>
> 1. Clients could do compaction, before sending versions to the server.
> 2. Clients can also tell server to discard certain older versions,
> when it decides to store a full file.
>
> Is this compaction you are referring to, or is it the Snapshot feature?
Both interpretations need a better discussion.
Compaction: This is done at a later stage when some intermediate revisions
need to be 'forgotten'. This means that the changes made in those revisions
need to be merged with future versions. Challenge is that the server cannot
handle this alone without reading the diff's. Client involvement
necessitates network communication.
Snapshotting: This is done on an incremental basis. Hence when the decision
to store a snapshot is made, the server stores the complete file. None of
the older revisions need to be dropped (in fact it is wrong). Future diff
storage just happens the usual way. The only advantage is we do not retrace
till revision 1.
>
> 1. The frequency of storing revisions should be flexible later on, but
> for quick early release we can compromise.
>
> Hope a better discussion can be had now. Please resply inline, if you are
> going to respond to each point individually.
>
This brings me to the confusion:
1. Perform complete encryption at the client. Server cannot read the
diff's. Problem: Forgetting/Merging intermediate revisions needs to be done
at the client, which involves network.
2. Diff-visible encryption. Data is still encrypted at the client, but
the server can gather enough knowledge to perform revision merges on its own
(I am going to try exploring this area in the next few days - others are
welcome too)
3. Or, find a really good algorithm to minimize network traffic for
revision merging.
> --
> Aditya.
>
> [1]: http://samba.anu.edu.au/rsync/tech_report/
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~acmeattic-devel<https://launchpad.net/%7Eacmeattic-devel>
> Post to : acmeattic-devel@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~acmeattic-devel<https://launchpad.net/%7Eacmeattic-devel>
> More help : https://help.launchpad.net/ListHelp
>
>
--
Karthik
Follow ups
References