syncany-team team mailing list archive

Thread
Date
Re: Bigger database issues

To: syncany-team@xxxxxxxxxxxxxxxxxxx
From: Philipp Heckel <philipp.heckel@xxxxxxxxx>
Date: Mon, 09 Dec 2013 01:10:27 +0100
In-reply-to: <52A44E89.3000000@apiacoa.org>
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5
Hello again,

I hope y'all had a great weekend.
> I would say that each of us as a way of discussing and thinking about
> code which is different and reflects one's background. I'm an academic,
> I'm more at ease discussing things on a theoretical/philosophical level
> and then moving to more concrete things. But I will not ask others to
> follow me there ;-) So the discussing with Gregor was really nice for me
> but I understand totally that some code is needed at some point and I
> also understand that you Philipp and some others will probably waint
> until things a little more concrete to comment. In my opinion, everybody
> wins by having this discussion in several steps.

That is true. It's probably good to have different views and different
approaches to these kind of issues. Otherwise I probably wouldn't even
have seen the ID topic as an issue. But we'll all profit from this
discussion. I hope I haven't offended you! Gregor also told me that the
he found the e-mail discussions very useful, so please keep discussing :-)

> Of course, but leaving a long id was a (very small) risk and moving to a
> better solution is needed. A simple solution like using ByteArray is
> clearly possible, but I think the solution proposed by Gregor (and me)
> is not that complicated.
> (...)
> Ok. I've pushed some additional modifications in the line of FileId to
> my branch (longer-file-id), but it's not based on Gregor's design.
I definitely see that now, and I think that it's really easy to
implement -- even though it will touch many different files. I just took
your code and merged it into the master, and then started to eliminate
the ByteArray and byte[] based IDs as far as I could. Here's the result:

-
https://github.com/binwiederhier/syncany/commit/77d545140376b44e0b69ca5bf5de939c3be9e69f
-
https://github.com/binwiederhier/syncany/commit/46a6ae1ba2c3cf3b5cdcb81f445a77eda5455131

I basically did what we already started for the ChunkEntryId/FileId, but
extended the IDs to the other two database objects as well:

- ChunkChecksum (previously ChunkEntryId, identifies a ChunkEntry by the
chunk's checksum, not random!)
- FileChecksum (identifies FileContent by the file's checksum, not random!)
- FileHistoryId (identifies PartialFileHistory, random!)
- MultiChunkId (identifies MultiChunkEntry, random!)

They all inherit from ObjectId (Fabrice's version, not Gregor's), but
they could now easily be refactored to implement the ShortId/ByteArrayId
variant we discussed. I also removed the byte[] representation in the
actual object:

public class ChunkEntry {
    private ChunkChecksum checksum;
    ...
}

There are still a few internal representation leaks:
- All constructors allow passing byte[], e.g. new ChunkChecksum(byte[])
- A few times, an actual byte[] is required (getRaw()) method --
although this method is marked deprecated

What do you think? A bit better?

> My personal problem with that is again a theoretical one or a conceptual
> one, if you prefer. I think a design document is _absolutely_ needed if
> you want to obtain something correct for the database (in memory,
> locally on disk and remotely on storage). I mean that you had a first
> implementation in the older code base which brought a lot of insights
> and allowed to identify two major problems: version control of the
> database itself and communication issues around this version control. In
> the second implementation, you have something quite stable that contains
> a informal specification of the version control system (based on vector
> clock and such) and of the communication (delta based). But you also
> also identified representation issues. My recommendation is to use this
> second implementation as the basis of a design document of the
> representation rather than using JPA and hoping for the best (which
> won't happen, as all the benchmarks I've seen show quite bad
> performances of jpa compared to jdbc).
If you put it that way, it really sounds like trial-and-error, doesn't
it? :-)

You are probably really, really right. So far, I've discussed the
software architecture with Steffen a lot and if we thought something
made sense, we've started to implement it (just because that's more fun
than writing documents). Most of the time, you realize pretty quickly if
something works, or if it doesn't.

But maybe it's better to write/draw something first to find the best
data representation.

> I think we need first an entity-relationship model of the data.
> For instance, we have a Chunk entity and a MultiChunk entity, with a "is
> made of" relation, etc. It would be way simpler to reason on such a
> model than on a bunch of classes.
I'm not much of a formal model-person, but I'd say this is more of a
class diagram than an ER diagram. Although in this case I wouldn't see
much of a difference:
https://raw.github.com/binwiederhier/syncany/46a6ae1ba2c3cf3b5cdcb81f445a77eda5455131/docs/Diagram%20Database.png

Maybe this can be extended (if we need to).
(Side note: Do you know any good open source cross-platform modelling
tools; at work I use Enterprise Architect, but that's far from open source)

> Then we need to identify scenarios and see what they need in terms of
> request to the model.
Great idea!

I have no idea why we didn't think of that earlier. I sort of always had
the 'scenarios'/'use cases' in my head, but everyone else obviously
doesn't, so this is definitely necessary to define useful views of/on
the model. And that's also the reason for the very weird caching stuff
in the database.

I'll try to draw/write down something!
Do you have any suggestions in terms of format? use cases + activity
diagrams?

> For instance when one wants to up his/her modifications, a file tree
> walking will take place. This needs to browse both the file system and
> the entire last known state of the remote storage (a.k.a the current
> version of all files) to compare them. A very bad idea would be to walk
> the tree and query the database for each file to get the last known
> state of the file: because of the round trips between the walkionng code
> and the database code, this will waste a large amount of time. In
> addition, one needs to detect deleted files, which can only be done via
> a full scan of the database. So I think in this case, we need to fully
> load the current state of the database, which is more or less a Map
> between file path (file key in the future to leverage inodes) and its
> metadata. This gives a first constraint to the database representation:
> it needs to be able to produce such a "current state snapshot"
> efficiently. Should be doable with a "select path,metadata from
> somewhere where version=current" (or something similar, you see what I
> mean).
Yep!
> Other scenarios will ask for other things.
> For instance, in the watcher case, maybe individual queries will make
> sense. Also when one looses a race in the upload, the last commit must
> be rollback (sort of), so one will need a way to identify the last commit.
> Another example is the clean up operation: one needs a way to identify
> chunks that are no longer used.
Yep!
> Without all of this, I think we are stuck into the current very complex
> java representation of the data. This representation was needed to build
> a working version of syncany and I'm truly impressed by the result. Now
> that it works, it's time to sort things out without trying first to
> optimize storing this representation as if it were dictated by the data.

Yes and no. I agree that it's important to get a better idea of what
view on the data model we need (~ which 'select statements' or 'getter
methods' should perform well). However, I do not agree that trying to
persist the existing data model is necessarily a bad thing -- and it's
certainly not just an optimization. Because once we have the data model
in a sql-based database, we can easily create the views we need for the
above mentioned scenarios.

Next steps:
- I'll try to draw/draft some scenarios and corresponding diagrams; if
you want, feel free to suggest/help!
- I'll also try to do a first prototype of the cleanup operation to
delete old metadata (purpose: (1) reduce memory footprint for now, (2)
insight gathering about cleanup)
- Gregor will play around with the JPA stuff and try to persist/load to
a relational database (purose: also for prototyping+exploring
performance and compatibility with the data model)
- I'd like to move with all branches to Gradle by the end of the week,
so if you haven't done any testing there, please do :-)

Best,
Philipp
Follow ups

Re: Bigger database issues
From: Fabrice Rossi, 2013-12-09
References

Bigger database issues
From: Philipp Heckel, 2013-12-07
Re: Bigger database issues
From: Fabrice Rossi, 2013-12-08