acmeattic-devel team mailing list archive

Thread
Date
Re: Modifications to versioning system proposal

To: acmeattic-devel@xxxxxxxxxxxxxxxxxxx
From: Aditya Manthramurthy <aditya.mmy@xxxxxxxxx>
Date: Sat, 10 Jul 2010 14:27:53 +0530
In-reply-to: <AANLkTima_sxoDjCcU5q4b_nQIXmffa9S8FLdUKBYCocU@mail.gmail.com>
On 10 July 2010 01:11, Karthik Swaminathan Nagaraj <nkarthiks@xxxxxxxxx>wrote:

>
>
> On Thu, Jul 8, 2010 at 2:17 PM, Aditya Manthramurthy <aditya.mmy@xxxxxxxxx
> > wrote:
>
>>  Let me summarise the versioning system proposal so far:
>>
>>    1. Forwards diffs are kept. So if the various versions of a file F are
>>    F1, F2, F3, etc, then the server stores: F1, d(F1,F2), d(F2, F3), ...
>>     2. Because of storing forward diffs, the first checkout to a client
>>    will require a large download to get to the latest copy of a file. To
>>    decrease the size of the download, we could decide to store the whole file
>>    at a point, instead of a diff. This would look like (on the server), F1,
>>    d(F1,F2), d(F2,F3), F4, d(F4, F5), etc. A simple metric to decide when to
>>    store the whole file would be to see if the size of F1 + d(F1,F2) + d(F2,
>>    F3) is greater than F4.
>>
>> This is called Snapshot. Can we use this terminology?
>

I think the high level feature can be called snapshotting, but at the
implementation level, we need to make a difference between a diff revision
and a full copy revision.


>
>>    1. All diffs and files are sent encrypted to the server.
>>    2. Text files can use normal line based diff algorithms. Binary files
>>    won't generate good diffs with such algorithms. Instead we can use an
>>    rsync-like algorithm [1] for binary files. It will generate better diffs.
>>    For early releases, we can compromise to just store the full files of
>>    successive revisions for binary files.
>>
>> On this note, Mercurial uses a optimized C implementation of difflib which
> is used to compute diff's for both text and binary. One of Mercurial's
> beauty is handling both text and binary storage the exact same way. If we
> adopt this or a similar algorithm, we can probably do the same.
>

Here is some info about how mercurial handles binary files:
http://mercurial.selenic.com/wiki/BinaryFiles It does not really say
anything about how it generates the diff for binary files. Indeed it seems
to say that binary files are not processed by the mercurial diff
sub-command. How about giving some links to the mercurial diff algo for file
storage? I couldn't find much by myself.


> From the rsync algorithm, are you suggesting that we do not compute the
> diff's at the client without server intervention? If you are, then the
> server copy is encrypted which makes our lives hard. If you are not, why the
> reference to this rsync algorithm for just binary files?
>

No, I didn't mean that the rsync algorithm to be used directly without any
modification. We will not be using diffs of files between server/client,
etc. The client machine keeps a copy of the pristine file, and uses that to
generate a diff once the file is modified. I was just saying that the rsync
algorithm can be used to find a good block based diff between two binary
files. We can use the mercurial way (whatever that is) if it actually
produces a diff rather than keeping a copy of the file.


> I do not understand why binary files necessitate the usage of a
> bi-directional diff between client and server.
> On the same note, we do need a good scheme to handle diff at the client
> with or without file copies. I wonder how DropBox and SpiderOak solve this.
>

Keeping a copy ought to be the approach for the earlier releases. I am
getting worried that we are making too many plans for the features before
taking any concrete steps to making a release plan. We should get off the
ground soon. I think a sprint based approach would be good to get things
started.


>
>>    1.
>>    2. All this does seem like we need a custom version management module,
>>    instead of reusing one like hg, git or bzr. But I think performance and
>>    flexibility-wise, we are better off writing our own. We can always borrow
>>    code from these other s/w.
>>    3. The number of versions, etc to store is still not decided. We could
>>    perhaps discuss this more. Issues related to this:
>>       1. The server sees only encrypted data. So it will not be able to
>>       process diff files to compact, etc, at least with the current encryption
>>       scheme of using AES.
>>
>> By compaction, I am assuming you are referring to the periodic procedure
> that forgets revisions. Compaction somehow sounds like compression to me
> which is done by VCS's (zip, bzip).
>

That's right. I mean keeping a compact history of a file without keeping it
too detailed.

>
>>    1. Clients could do compaction, before sending versions to the server.
>>       2. Clients can also tell server to discard certain older versions,
>>       when it decides to store a full file.
>>
>> Is this compaction you are referring to, or is it the Snapshot feature?
> Both interpretations need a better discussion.
> Compaction: This is done at a later stage when some intermediate revisions
> need to be 'forgotten'. This means that the changes made in those revisions
> need to be merged with future versions. Challenge is that the server cannot
> handle this alone without reading the diff's. Client involvement
> necessitates network communication.
>
> Snapshotting: This is done on an incremental basis. Hence when the decision
> to store a snapshot is made, the server stores the complete file. None of
> the older revisions need to be dropped (in fact it is wrong). Future diff
> storage just happens the usual way. The only advantage is we do not retrace
> till revision 1.
>
>>
>>    1. The frequency of storing revisions should be flexible later on, but
>>       for quick early release we can compromise.
>>
>> Hope a better discussion can be had now. Please resply inline, if you are
>> going to respond to each point individually.
>>
>
> This brings me to the confusion:
>
>    1. Perform complete encryption at the client. Server cannot read the
>    diff's. Problem: Forgetting/Merging intermediate revisions needs to be done
>    at the client, which involves network.
>
> I think this can be done in a way, without having to use too much network.
My idea is based on the assumption that the revisions schedule for the files
are already known in advance, for eg, 7 daily revisions, 4 weekly, 12
monthly, and 5 yearly. Suppose we are taking daily revisions, we should be
able to decide which versions are useless after a point and delete/merge
them with limited client traffic. I will think about it better and make a
post later.

>
>    1.
>    2. Diff-visible encryption. Data is still encrypted at the client, but
>    the server can gather enough knowledge to perform revision merges on its own
>    (I am going to try exploring this area in the next few days - others are
>    welcome too)
>
> I don't think this will work, with good encryption algorithms, but its only
a feeling. Bharath also tried looking at something like this.

>
>    1. Or, find a really good algorithm to minimize network traffic for
>    revision merging.
>
>
> Finally, this is probably going to be the way.

--
Aditya.
References

Modifications to versioning system proposal
From: krishnan parthasarathi, 2010-07-05
Re: Modifications to versioning system proposal
From: Aditya Manthramurthy, 2010-07-08
Re: Modifications to versioning system proposal
From: Karthik Swaminathan Nagaraj, 2010-07-09