← Back to team overview

syncany-team team mailing list archive

Re: Bigger database issues

 

Hello again,
>> Something in the sense of
>>
>> select f.* from databaseversions dbv join filehistories fh on ... 
>> join fileversions fv on ... where dbv.date < '2013-12-12 18:10:00' 
>> and fv.version = (select max(version) from fileversions fv2 where 
>> fv.id= fv2.id)
>>
>> In fact, that's what I was trying to do when I stated the "Data 
>> model" wiki page, trying to identify which database views we need and
>> derive the SELECT statements ...
> I'm under the impression that there might be subtleties related to the
> way conflicts are handled in the down phase which could prevent this
> type of query to work. More precisely, can you guarantee that version
> numbers are strictly increasing for each file? It seems possible but I'm
> a bit frightened by the complexity of the reconciliation code and thus
> no entirely sure of its theoretical properties.
Not that I know of. Right now, the only stuff being persisted is the new
database versions (in full). File versions of file contents are not
added individually to the database. So from my perspective, this (and
other) queries would work. Obviously, you'd had to make sure that you
would only select non-DELETED file versions, but thats the only thing
right now that could make the queries a bit more complicated.

The FileVersion.version attribute is strictly increasing each time a
file version is updated. Old versions might be rolled back (deleted!)
and new file versions might be added (with an identical version
attribute) if a database version lost and is replaced. I hope that was
clear ... It's late :-D

>>> I'm not sure how this should be represented in a persistent state 
>>> but based on what is done in most of the version control systems I 
>>> know, I think we need a CurrentDatabase entity which aggregates
>>> one FileVersion (the current one) for each path of repository.
>> Do you want to persist the CurrentDatabase, or just "create" it on 
>> the fly?
>>
>> If it's the former, I don't know about this. I'm not saying the
>> other version control systems are wrong, but having a single 
>> CurrentDatabase (representing the last state) is not sufficient, 
>> because we need to be able to go back in time for the restore 
>> operation (anywhere else?)
> I think it might be a good idea to persist the current database and I
> don't think it's incompatible with going back in time. Actually it's the
> standard practice in delta encoded version control systems (see for
> instance the skip deltas of svn
> http://svn.apache.org/repos/asf/subversion/trunk/notes/skip-deltas).

This is very interesting.The whole folder is a gold mine:
http://svn.apache.org/repos/asf/subversion/trunk/notes)

Skip deltas are necessary for delta based data storage -- so when the
difference between file1 and file2 is being recorded instead of storing
the whole file. As the text nicely explains, this is to avoid having to
walk through 1000 revisions to reconstruct the file. With Syncany, we do
not have this problem on a file basis, because files (more specificly:
file contents) are just a sorted list of pointers to chunks -- and a
file content always contains all pointers.

So what I'm trying to say is that our problem does not affect the actual
data, but just the metadata; and I'm not so sure that we can simply
apply the same logic here.

But if we did apply this logic, how do you think this could/would look like?

Brainstorming:
- I image we'd have skip-databaseversions, e.g. every 50 database
versions, we'd add all file histories/file versions/etc.
- Or we could have a new filetree sql table, basically mapping each/some
database versions to a file tree: dbv -> map<filename, fileversion>
(basically a replacement for the caches, but in a SQL database
- Other ideas? Are you sure this is necessary? Or is it only necessary
if we're unable to construct good/fast/easy SQL queries?!

> I'm far from being the DAO expert ;-) What I had in mind is specific in
> memory representation of the database for some operations. For instance,
> the status operation compares the local file tree to the database file
> tree. The best way to do that is to load a map from file path to
> FileVersion. During the up operation, the indexer needs to know existing
> chunks. In this case, the best way to do that seems to have a map from
> checksums to chunks. Basically, each operation calls for a specific data
> structure that will load in memory a view of the database. Of course,
> one can issue a select whenever some information needs to be fetch from
> the database, but this would mean putting a lot of trust on the caching
> capabilities of the database. In the case of status, I'm almost sure
> this will not work. For other operations, like chunking, as we will be
> loading things from the disk, I'm not sure fully loading the checksum to
> chunk map is such a good idea.
>
> Is that clearer?

Very! I'll try to talk to Gregor tomorrow and see how far he has gotten
with the whole database stuff.

You also been playing around with MySQL Workbench. Any new insights? Do
you have an updated version? Have you tried some of the queries we need?
I'm pretty surprised by MySQL Workbench. It's a cool tool, and it
supports importing/exporting a model.

Best,
Philipp


Follow ups

References