← Back to team overview

syncany-team team mailing list archive

Re: Bigger database issues

 

Hi,

Le 08/12/2013 00:17, Philipp Heckel a écrit :
> Now to the topic: While I am really, really happy that you guys are
> discussing so enthusiastically, I think we're drifting a bit into
> philosophical and academic discussions. Please do not get this the wrong
> way, I think discussion is important, but I think that sometimes code is
> easier to understand  -- especially when it's a relatively small change
> in code (like with the IDs). That's why I suggest to simply play around
> in code and show us what you mean.

I would say that each of us as a way of discussing and thinking about
code which is different and reflects one's background. I'm an academic,
I'm more at ease discussing things on a theoretical/philosophical level
and then moving to more concrete things. But I will not ask others to
follow me there ;-) So the discussing with Gregor was really nice for me
but I understand totally that some code is needed at some point and I
also understand that you Philipp and some others will probably waint
until things a little more concrete to comment. In my opinion, everybody
wins by having this discussion in several steps.

> Also -- and again: do not take this the wrong way! -- there are many
> important things to do to get a working piece of software, and I feel
> that the ID question is more of an optimization. Now I know that Fabrice
> likes to get to 1MM files (and believe me we'll get there!), but we
> first need to be able to perform a cleanup of files and file versions,
> and represent the local database in general in a more efficient way. So
> if you will: there are bigger issues to consider when drafting an ID
> solution, and bigger issues to solve in general :-)

Of course, but leaving a long id was a (very small) risk and moving to a
better solution is needed. A simple solution like using ByteArray is
clearly possible, but I think the solution proposed by Gregor (and me)
is not that complicated.

> [..] 
> Next steps:
> - I'm meeting with Gregor tomorrow: My original goal was to talk about
> the database stuff in general, but I guess we'll also talk over the ID
> stuff. Maybe we'll be enlightened then. We'll review all the code and
> suggestions and hopefully implement something. (Btw. I liked the
> ShortId<T> & ArrayId<T> idea)

Ok. I've pushed some additional modifications in the line of FileId to
my branch (longer-file-id), but it's not based on Gregor's design.

> - It would be very valuable to me if you could review the general
> Database in-memory representation. My solution to the ever-growing local
> RAM was to simply put everything in a local SQL database, and load it on
> demand, but the JPA stuff is complex and maybe it can be done more
> easily ... Ideas?

My personal problem with that is again a theoretical one or a conceptual
one, if you prefer. I think a design document is _absolutely_ needed if
you want to obtain something correct for the database (in memory,
locally on disk and remotely on storage). I mean that you had a first
implementation in the older code base which brought a lot of insights
and allowed to identify two major problems: version control of the
database itself and communication issues around this version control. In
the second implementation, you have something quite stable that contains
a informal specification of the version control system (based on vector
clock and such) and of the communication (delta based). But you also
also identified representation issues. My recommendation is to use this
second implementation as the basis of a design document of the
representation rather than using JPA and hoping for the best (which
won't happen, as all the benchmarks I've seen show quite bad
performances of jpa compared to jdbc).

I think we need first an entity-relationship model of the data.
For instance, we have a Chunk entity and a MultiChunk entity, with a "is
made of" relation, etc. It would be way simpler to reason on such a
model than on a bunch of classes.

Then we need to identify scenarios and see what they need in terms of
request to the model.
For instance when one wants to up his/her modifications, a file tree
walking will take place. This needs to browse both the file system and
the entire last known state of the remote storage (a.k.a the current
version of all files) to compare them. A very bad idea would be to walk
the tree and query the database for each file to get the last known
state of the file: because of the round trips between the walking code
and the database code, this will waste a large amount of time. In
addition, one needs to detect deleted files, which can only be done via
a full scan of the database. So I think in this case, we need to fully
load the current state of the database, which is more or less a Map
between file path (file key in the future to leverage inodes) and its
metadata. This gives a first constraint to the database representation:
it needs to be able to produce such a "current state snapshot"
efficiently. Should be doable with a "select path,metadata from
somewhere where version=current" (or something similar, you see what I
mean).
Other scenarios will ask for other things.
For instance, in the watcher case, maybe individual queries will make
sense. Also when one looses a race in the upload, the last commit must
be rollback (sort of), so one will need a way to identify the last commit.
Another example is the clean up operation: one needs a way to identify
chunks that are no longer used.

Without all of this, I think we are stuck into the current very complex
java representation of the data. This representation was needed to build
a working version of syncany and I'm truly impressed by the result. Now
that it works, it's time to sort things out without trying first to
optimize storing this representation as if it were dictated by the data.

Cheers,

Fabrice



Follow ups

References