← Back to team overview

syncany-team team mailing list archive

Re: Bigger database issues

 

Hello again.


My suggestion is not however to change things but to see if they need to
> be changed. I experimented with PlantUML for this and ended up with the
> class diagram attached to this email (with is PlantUML sources). It's
> more detailed than Philipp's one as I tried to include attributes that I
> considered strongly attached to each entity (each entity would be a
> table in a relational db). It is also quite simpler because it removes
> all the caches and redundant parts.
>

That is indeed the minimal (logical) diagram, and much cleaner --
eliminating the explicit associations to DatabaseVersions and missing the
DatabaseVersionHeader and all the caching ...

One minor adjustment: The association between MultiChunkEntry and
ChunkEntry is a n:m (that's also wrong in my diagram), because it can
happen that two clients index the same file at the same time, thereby
adding the same chunks to a different multichunk.

(...) but I like being Captain Obvious.
>

Well, Captain Obvious, ... It's only obvious after you've someone points
out the elephant in the room.
Related :-) : http://www.youtube.com/watch?v=Ahg6qcgoay4

 (...) and of one major issue: there is no simple way to query
> this data model in order to get the current state of the repository.
> Indeed, you need to reconstruct the winning series of commits (i.e., of
> DatabaseVersion) and to walk said series to determine the current status
> of all files. This is mainly caused by DatabaseVersion being delta and
> not complete commits. In the current code this is handled by a full
> database cache which induces more or less a full duplication of the
> database.
>

Like most times, this is EXACTLY what's wrong with the current database
representation -- and one of the reasons why I thought a SQL backend would
be a lot simpler to handle. In my naive mind, I thought that storing the
clean data model (as per your diagram) in the SQL database, and query a
current state when needed. Something in the sense of

select f.*
from databaseversions dbv
join filehistories fh on ...
join fileversions fv on ...
where
   dbv.date < '2013-12-12 18:10:00'
   and fv.version = (select max(version) from fileversions fv2 where fv.id=
fv2.id)

In fact, that's what I was trying to do when I stated the "Data model" wiki
page, trying to identify which database views we need and derive the SELECT
statements ...


> I'm not sure how this should be represented in a persistent

state but based on what is done in most of the version control systems I
> know, I think we need a CurrentDatabase entity which aggregates one
> FileVersion (the current one) for each path of repository.
>

Do you want to persist the CurrentDatabase, or just "create" it on the fly?

If it's the former, I don't know about this. I'm not saying the other
version control systems are wrong, but having a single CurrentDatabase
(representing the last state) is not sufficient, because we need to be able
to go back in time for the restore operation (anywhere else?)

>From this on disk/in relational db data model, each operation can derive
> what it needs in memory, based on some specific DAO if needed.
>

Can you elaborate on that? Does that mean having a RestoreDAO that
implements specific queries (such as the one above)?

What do you think of all that?
>

Good stuff Fabrice, as always!! This is so incredibly helpful!

As a quick recap, here is what I understand is wrong with the current data
model (or better: its representation in code):
- Explicit relationship between DatabaseVersion and lists of ChunkEntry,
MultiChunkEntry, etc. should not be explicit
- No easy view on the current/latest database
- Minor: no optimal (not normalized?) representation of the file version
attributes (I don't see an issue here)
- Anything else?

Best
Philipp

Follow ups

References