syncany-team team mailing list archive

Thread
Date
Re: Bigger database issues

To: Fabrice Rossi <Fabrice.Rossi@xxxxxxxxxxx>
From: Philipp Heckel <philipp.heckel@xxxxxxxxx>
Date: Tue, 10 Dec 2013 00:11:33 +0100
Cc: Syncany Mailing List <syncany-team@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <52A5F353.1050602@apiacoa.org>
Hello again,

Looks great, but something is off with the last commit as it does not
> compile anymore. I think ObjectId has disappear somehow and getChecksum
> is also missing. I've not tracked the commits in detail as I think it's
> better if you fix it on your side, being the last commiter.
>

I forgot to add the ObjectId class ... I moved it to the database package
to reduce cross package dependencies.

The design of the second version shows again at lot of efforts to solve
> a _very_ complex problem, this version control thing without a server.
> Based on what I've read on distributed systems (and my work on machine
> learning on the cloud), syncany is trying to address possibly one of the
> most difficult situation in distributed systems: no central coordination
> but also no guarantee that clients will be on together.
>

True, it is indeed very complex. That's why the exploratory prototyping is
necessary ...


> (...) As far as I know, Benjamin Pierce's Harmony is one
> of the unique system with related (but not similar) constraints. (Pierce
> is Unison's mastermind.). So in this situation, trial and error is
> needed, but this applies to theory as well and this is not dumb trial
> and error, it's experimental work!
>

I remember reading the Unison/Pierce papers for my master thesis... Vaguely
:-)

(..)
> > Maybe this can be extended (if we need to).
>
> I think so, as it seems no enough detailed in my opinion. I'll give it a
> try.
>

Cool.


>  I'm not a big fan of everything I tried, but it was years ago. Recently,
> I've been more and more attracted to text based solutions, which is why
> I will try plantUML to contribute data model
> (http://plantuml.sourceforge.net/). But there are probably better
> solutions
>

Looks pretty cool, even with Eclipse/JavaDoc integration:
www.youtube.com/watch?v=hd2dG6Xvn58


> I'm not sure, maybe something less formal, like a short description (in
> English) of what data are needed during each step of a command?
>

I've started to draft a wiki page with my two cents on the topic. My main
interest right now is defining which SELECT statements / getters will be
required in a SQL-based DAO. The stuff is mostly copied from JavaDoc, the
only new stuff is the use case section, but it's just vague bullet points
...

https://github.com/binwiederhier/syncany/wiki/Data-model

Feel free to edit/add your content!!


> I was not very clear, so let me rephrase. The current data model is
> based on some unwritten hypotheses that might lead to some difficulties
> (or not!), and as persisting this model into a database is a difficult
> task (whatever the technology you use), I think it would be nice to
> review the model and the associated hypotheses before hand (or at least
> in parallel, I agree that having the implementation, even partial, helps
> sometimes a lot to understand things ;-).
>

Agreed. I think the wiki page above can be a starting point to do that.


> For instance, Syncany tracks
> file rename but git does not (and Linus is quite vocal about that). This
> means that when the tracking goes ok, syncany stores explicitly a
> (Partial)FileHistory with sometimes redundant things (like many times
> the same file path). What are the consequences of this choice in terms
> of requests, caches, memory and disk occupations, etc.?
>

This might be something worth analyzing, but I don't see any big issues
here. Maybe it's still good to review it. Feel free to do so and ask/write
down what you think.

<snip: "rename" example explanation>
Even though it's just an example, I'm still going to answer :-)
For anyone who's interested:
https://git.wiki.kernel.org/index.php/Git_FAQ#Why_does_Git_not_.22track.22_renames.3F

The rename tracking in Syncany is just a best guess and has not a lot of
influence on the file synchronization in the "down" operation: While a
PartialFileHistory might contain multiple file versions (indicating a
rename), the comparison in the reconciliation is done by comparing (1) the
local file system with (2) the local database and (3) the winning database
(three-way comparison).

Example:
<fileHistory id="82669b51966546d7c1e1a1da7d9b515b3bd6c831">
  <fileVersions>
    <fileVersion version="1" status="NEW" path="file1.jpg"
checksum="b7ba6..." ... />
    <fileVersion version="2" status="RENAMED" path="file-RENAMED.jpg"
checksum="b7ba6..." ../>
  ...

In the "up" (indexing) operation, Syncany has determined that the file was
renamed (because file1.jpg was missing, and file-RENAMED.jpg was there with
the same checksum). In the "down" operation (reconcile), Syncany only
compares what it expects (2) with what is actually there (1), and what
should be there according to the downloaded updates (3). From that, a file
system action is determined. Assuming that locally, only version 1 is known
(and version 2 is new),  "file1.jpg" is renamed to "file-RENAMED.jpg"
because Syncany expects "file1.jpg" to be there (if file system and local
db are in sync; (1) and (2)) and knows that the new file version has a
different path, but the same checksum...
</snip: "rename" example explanation>


> > - I'll also try to do a first prototype of the cleanup operation to
> > delete old metadata (purpose: (1) reduce memory footprint for now,
> > (2) insight gathering about cleanup)
>
> Great! I've some basic idea for a locking api for plugins. I've tried to
> design a lock free version of the cleanup, but nothing stands up
> scrutiny...
>

I started to implement the cleanup with the current DB representation and I
immediately remembered why we decided to go for a SQL database. The list
walking is inefficient and impossible to understand. Don't look, the code
is incredibly ugly :-)

Best,
Philipp
References

Bigger database issues
From: Philipp Heckel, 2013-12-07
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-08
Re: Bigger database issues
From: Philipp Heckel, 2013-12-09
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-09