← Back to team overview

syncany-team team mailing list archive

Re: Bigger database issues

 

Hi,

Le 09/12/2013 01:10, Philipp Heckel a écrit :
> [discussion style...] That is true. It's probably good to have 
> different views and different approaches to these kind of issues. 
> Otherwise I probably wouldn't even have seen the ID topic as an 
> issue. But we'll all profit from this discussion. I hope I haven't 
> offended you!

Not at all! You need to keep the project on track, you have to be clear
on the priorities. And you don't do that in a Linus' style, so that
perfectly ok ;-)

> Gregor also told me that the he found the e-mail discussions very 
> useful, so please keep discussing :-)

Will do :-)

>> [..] Ok. I've pushed some additional modifications in the line of 
>> FileId to my branch (longer-file-id), but it's not based on 
>> Gregor's design.
> I definitely see that now, and I think that it's really easy to 
> implement -- even though it will touch many different files. I just 
> took your code and merged it into the master, and then started to 
> eliminate the ByteArray and byte[] based IDs as far as I could.

Looks great, but something is off with the last commit as it does not
compile anymore. I think ObjectId has disappear somehow and getChecksum
is also missing. I've not tracked the commits in detail as I think it's
better if you fix it on your side, being the last commiter.

> [..] What do you think? A bit better?

I think so ;-)

>> [theory and practice ;-)..]
> If you put it that way, it really sounds like trial-and-error, 
> doesn't it? :-)

Ah sorry about that, my turn to apologize (no offense intended in the
first place). What I meant is that you did a lot of research and
conception on the chunking part, which means that you probably did not
have time during your master to be as thorough on the version control
part. Hence the « mistakes » on the first version.

The design of the second version shows again at lot of efforts to solve
a _very_ complex problem, this version control thing without a server.
Based on what I've read on distributed systems (and my work on machine
learning on the cloud), syncany is trying to address possibly one of the
most difficult situation in distributed systems: no central coordination
but also no guarantee that clients will be on together. This is really
tough because you can neither assume clients will answer to a broadcast
in reasonable time (contrarily to classical distributed systems in which
such hypothesis are made in limited form) nor assume that you are alone
uploading some files. As far as I know, Benjamin Pierce's Harmony is one
of the unique system with related (but not similar) constraints. (Pierce
is Unison's mastermind.). So in this situation, trial and error is
needed, but this applies to theory as well and this is not dumb trial
and error, it's experimental work!

> You are probably really, really right. So far, I've discussed the 
> software architecture with Steffen a lot and if we thought something 
> made sense, we've started to implement it (just because that's more 
> fun than writing documents). Most of the time, you realize pretty 
> quickly if something works, or if it doesn't.

Agreed. I tend to overthink things, ending with, well nothing expect
documents ;-). In France we say the "le mieux est l'ennemi du bien"
(from Voltaire), which translates into "the best is the ennemy of the
good". I've to remind this every morning to get things done :-D

> But maybe it's better to write/draw something first to find the best 
> data representation.

Let's say a _good_ data representation :-D

>> I think we need first an entity-relationship model of the data. For
>> instance, we have a Chunk entity and a MultiChunk entity, with a
>> "is made of" relation, etc. It would be way simpler to reason on 
>> such a model than on a bunch of classes.
> I'm not much of a formal model-person, but I'd say this is more of a 
> class diagram than an ER diagram. Although in this case I wouldn't 
> see much of a difference: 
> https://raw.github.com/binwiederhier/syncany/46a6ae1ba2c3cf3b5cdcb81f445a77eda5455131/docs/Diagram%20Database.png

Yes
>
>
> 
UML like class diagrams can be used to represent ER model. They tend
to deemphasize the relations between entities, which is why I prefer old
school ER, but that's not a big deal.
> 
> Maybe this can be extended (if we need to).

I think so, as it seems no enough detailed in my opinion. I'll give it a
try.

> (Side note: Do you know any good open source cross-platform modelling
> tools; at work I use Enterprise Architect, but that's far from open
> source)

I'm not a big fan of everything I tried, but it was years ago. Recently,
I've been more and more attracted to text based solutions, which is why
I will try plantUML to contribute data model
(http://plantuml.sourceforge.net/). But there are probably better solutions

> [..] I'll try to draw/write down something! Do you have any
> suggestions in terms of format? use cases + activity diagrams?

I'm not sure, maybe something less formal, like a short description (in
English) of what data are needed during each step of a command?

>> [..] Without all of this, I think we are stuck into the current
>> very complex java representation of the data. This representation
>> was needed to build a working version of syncany and I'm truly 
>> impressed by the result. Now that it works, it's time to sort 
>> things out without trying first to optimize storing this 
>> representation as if it were dictated by the data.
> 
> Yes and no. I agree that it's important to get a better idea of what 
> view on the data model we need (~ which 'select statements' or 
> 'getter methods' should perform well). However, I do not agree that 
> trying to persist the existing data model is necessarily a bad thing
>  -- and it's certainly not just an optimization. Because once we have
>  the data model in a sql-based database, we can easily create the 
> views we need for the above mentioned scenarios.

I was not very clear, so let me rephrase. The current data model is
based on some unwritten hypotheses that might lead to some difficulties
(or not!), and as persisting this model into a database is a difficult
task (whatever the technology you use), I think it would be nice to
review the model and the associated hypotheses before hand (or at least
in parallel, I agree that having the implementation, even partial, helps
sometimes a lot to understand things ;-). For instance, Syncany tracks
file rename but git does not (and Linus is quite vocal about that). This
means that when the tracking goes ok, syncany stores explicitly a
(Partial)FileHistory with sometimes redundant things (like many times
the same file path). What are the consequences of this choice in terms
of requests, caches, memory and disk occupations, etc.?

To put it differently, I think that while the high level design of the
database is fixed, it is compatible with several lower level designs and
maybe it's time to validate/invalidate possible options.

> Next steps: - I'll try to draw/draft some scenarios and corresponding
> diagrams; if you want, feel free to suggest/help!

I will!

> - I'll also try to do a first prototype of the cleanup operation to 
> delete old metadata (purpose: (1) reduce memory footprint for now, 
> (2) insight gathering about cleanup)

Great! I've some basic idea for a locking api for plugins. I've tried to
design a lock free version of the cleanup, but nothing stands up scrutiny...

> - Gregor will play around with the JPA stuff and try to persist/load
> to a relational database (purose: also for prototyping+exploring
> performance and compatibility with the data model)

I'll be watching that with interest, but my knowledge is too limited in
this area to really help.

> - I'd like to move with all branches to Gradle by the end of the
> week, so if you haven't done any testing there, please do :-)

Last time I tried, it worked perfectly.

Cheers,

Fabrice



Follow ups

References