syncany-team team mailing list archive

Thread
Date

Re: Bigger database issues

To: Philipp Heckel <philipp.heckel@xxxxxxxxx>
From: Fabrice Rossi <Fabrice.Rossi@xxxxxxxxxxx>
Date: Fri, 13 Dec 2013 17:18:50 +0100
Cc: Gregor Trefs <gregor.trefs@xxxxxxxxx>, Syncany Mailing List <syncany-team@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAAvm79biY2aR_f=8zzLwbtkPjngHpMu4WV+cn3aWkoDLZWN4RQ@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi,

Le 12/12/2013 20:34, Philipp Heckel a écrit :
> One minor adjustment: The association between MultiChunkEntry and 
> ChunkEntry is a n:m (that's also wrong in my diagram), because it
> can happen that two clients index the same file at the same time,
> thereby adding the same chunks to a different multichunk.

Ah, I overlook this one!

> (...) but I like being Captain Obvious.
>> 
> 
> Well, Captain Obvious, ... It's only obvious after you've someone 
> points out the elephant in the room. Related :-) : 
> http://www.youtube.com/watch?v=Ahg6qcgoay4

Nice one :-)

>> (...) In the current code this is handled by a full database cache
>>  which induces more or less a full duplication of the database.
>> 
> Like most times, this is EXACTLY what's wrong with the current 
> database representation -- and one of the reasons why I thought a SQL
> backend would be a lot simpler to handle.

Indeed.

> In my naive mind, I thought that storing the clean data model (as per
> your diagram) in the SQL database, and query a current state when
> needed.

Sounds good to me, up to some details ;-)

> Something in the sense of
> 
> select f.* from databaseversions dbv join filehistories fh on ... 
> join fileversions fv on ... where dbv.date < '2013-12-12 18:10:00' 
> and fv.version = (select max(version) from fileversions fv2 where 
> fv.id= fv2.id)
> 
> In fact, that's what I was trying to do when I stated the "Data 
> model" wiki page, trying to identify which database views we need and
> derive the SELECT statements ...

I'm under the impression that there might be subtleties related to the
way conflicts are handled in the down phase which could prevent this
type of query to work. More precisely, can you guarantee that version
numbers are strictly increasing for each file? It seems possible but I'm
a bit frightened by the complexity of the reconciliation code and thus
no entirely sure of its theoretical properties.

>> I'm not sure how this should be represented in a persistent state 
>> but based on what is done in most of the version control systems I 
>> know, I think we need a CurrentDatabase entity which aggregates
>> one FileVersion (the current one) for each path of repository.
> 
> Do you want to persist the CurrentDatabase, or just "create" it on 
> the fly?
> 
> If it's the former, I don't know about this. I'm not saying the
> other version control systems are wrong, but having a single 
> CurrentDatabase (representing the last state) is not sufficient, 
> because we need to be able to go back in time for the restore 
> operation (anywhere else?)

I think it might be a good idea to persist the current database and I
don't think it's incompatible with going back in time. Actually it's the
standard practice in delta encoded version control systems (see for
instance the skip deltas of svn
http://svn.apache.org/repos/asf/subversion/trunk/notes/skip-deltas).

>> From this on disk/in relational db data model, each operation can 
>> derive what it needs in memory, based on some specific DAO if 
>> needed.
>> 
> 
> Can you elaborate on that? Does that mean having a RestoreDAO that 
> implements specific queries (such as the one above)?

I'm far from being the DAO expert ;-) What I had in mind is specific in
memory representation of the database for some operations. For instance,
the status operation compares the local file tree to the database file
tree. The best way to do that is to load a map from file path to
FileVersion. During the up operation, the indexer needs to know existing
chunks. In this case, the best way to do that seems to have a map from
checksums to chunks. Basically, each operation calls for a specific data
structure that will load in memory a view of the database. Of course,
one can issue a select whenever some information needs to be fetch from
the database, but this would mean putting a lot of trust on the caching
capabilities of the database. In the case of status, I'm almost sure
this will not work. For other operations, like chunking, as we will be
loading things from the disk, I'm not sure fully loading the checksum to
chunk map is such a good idea.

Is that clearer?

> As a quick recap, here is what I understand is wrong with the current
> data model (or better: its representation in code):

Wrong is strong, I would say improvable, at least theoretically :-D

> - Explicit relationship between DatabaseVersion and lists of 
> ChunkEntry, MultiChunkEntry, etc. should not be explicit

Definitely.

> - No easy view on the current/latest database

That needs to be assessed.

> - Minor: no optimal (not normalized?) representation of the file
> version attributes (I don't see an issue here)

It's simply that some files do not need some of the attributes in
FileVersion which calls either for NULL in the database or for separate
entities. Hair splitting.

Cheers,

Fabrice

Follow ups

Re: Bigger database issues
From: Philipp Heckel, 2013-12-15

References

Bigger database issues
From: Philipp Heckel, 2013-12-07
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-08
Re: Bigger database issues
From: Philipp Heckel, 2013-12-09
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-09
Re: Bigger database issues
From: Philipp Heckel, 2013-12-09
Re: Bigger database issues
From: Philipp Heckel, 2013-12-11
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-12
Re: Bigger database issues
From: Philipp Heckel, 2013-12-12