← Back to team overview

syncany-team team mailing list archive

Database reconciliation

 

Hi,

I've be reviewing the reconciliation algorithm and code
(org.syncany.operations.DatabaseReconciliator). The code is quite
complicated, so I've focused on the algorithm. Please note that what
follows is my understanding of the algorithm, not Philipp's description.
I hope I did not make major mistakes.

If you've been following Syncany history, you know that handling
conflicts or, more generally, concurrency, was a main difficulty.
Philipp redesigned this part during the "reboot". I think the current
design is sound but needs improvement, and I explain why here.

One of the main design assumptions of Syncacy is that the storage is
dumb, removing the need for a central server. This means that syncany is
a distributed system and faces all the corresponding difficulties. For
the file content themselves, it is not a big deal because everything is
uploaded packed into multichunks which have almost guaranteed unique
names (while Syncany is not using UUID, it's the same idea
https://en.wikipedia.org/wiki/UUID). So when clients of a unique
repository are uploading things to the repository, no collision can
happen. In addition, it's easy to guarantee atomicity of one
upload/download either because the backend already guarantees it (S3) or
because one can use the classical move trick (upload to file.tmp and
then ask the backend to rename file.tmp to file).

So the only remaining issue is the database: each client has a local
database which describes the full state of the repository (file
metadata, mapping between files and chunks, mapping between chunks and
multichunks, history of files, etc.). Somehow, the clients have to agree
on this database. The purpose of the DatabaseReconciliator is to make
sure they do (do not confuse this with concurrent modifications of the
files, by the way).

The first trick is that a client is given a unique name (again uuid
like) and that it will upload its modifications of dabatase has deltas
named after this unique name and with a version number. For instance, if
we have a client A, its first upload will be of the form db-A-1. As for
the multichunks, no risk of overwriting there.

The second trick is the use of a vector clock
(https://en.wikipedia.org/wiki/Vector_clock). Each database delta
contains a vector clock which maps each client known to the producer of
the delta to the last known version of its database. For instance if A
and B share a repository created and first populated by A, the first
database is  db-A-1 and its vector clock is (A=>1). Assuming B makes the
second upload, it has first to download db-A-1. It then uploads db-B-1
with vector clock (A=>1,B=>1). This vector demonstrates a causal
relation between db-A-1 and db-B-1 in the sense that the latter happens
after the former, because db-B-1 was produced knowing db-A-1.

As long as two databases delta are not uploaded concurrently, everything
is quite simple: each new delta happens after the previous ones, and the
history is in fact linear.

Now because of the dumb storage, there is no simple way to prevent
concurrent uploads of database delta without impairing the performances
and/or risking starvation (in particular given the only eventual
consistency of S3). So we can end up with the following situation:

A create the repo and upload db-A-1 (A=>1)
B makes a change and upload db-B-1 (A=>1,B=>1)
A download the changes and upload db-A-2 (A=>2,B=>1)
B upload concurrently some changes to db-B-2 (A=>1,B=>2)

The two vector clocks are not comparable and there is no causal relation
between db-A-2 and db-B-2. In other words, when C tries to clone the
repository, it cannot infer a consensual state for the database. It will
download db-A-1 and applies the changes from db-B-1, but then it is
stuck because of the non causal relationship.

In the current implementation of the reconciliation, C (or any other
client) uses a deterministic tie breaking algorithm to choose a winner.
The algorithm uses the timestamp of db-A-2 and db-B-2 as recorded in
those deltas (this is the local time of A and B when they created the
delta). If by any chance the timestamps are identical, then the
repository ids are used to break the tie. So for instance here, the
official linearized/trimmed history becomes db-A-1 -> db-B-1 -> db-A-2
and db-B-b2 is discarded.

Because the tie breaking algorithm is deterministic, all clients will
agree on this linearized/trimmed history. Assume for instance that C has
downloaded the repository and uploaded its own changes. We have now a
new delta, db-C-1 (A=>2,B=>1,C=>1). Notice the B=>1 as db-B-2 was
discarded. When A comes back, it will consider the history to be
db-A-1 -> db-B-1 -> db-A-2 -> db-C-1 and will know how to apply db-C-1
to take into account the modifications. When B does the same things, it
will also agree on this history and will discover that its upload where
discarded. It has to take actions to be able to upload this again.

In my opinion, this works but is suboptimal as it assumes conflicts even
if none are present. Also, most of the literature on distributed systems
seem to agree on the need for a reconciliation to happen in order to fix
the problem. In the above case, it means producing a delta that has both
db-A-2 and db-B-2 in its causal history.

I think it's important to do that for several reasons:

1) when a client loose, its content modifications are in the storage (it
has uploaded the new multichunks before uploading the database delta),
yet they are not available for the other clients. This is a waste of
resources.
2) the only way to the modifications done on a looser to become
available is for the looser to reconnect latter to the storage, while
the needed content is already in the database. This prevents fast
conflict resolutions and is a typical starvation situation.
3) because the conflict is solved latter, the history walking has to be
done on each client, leading to some useless complication. Also, the
history remains non causal and strongly tied to the deterministic tie
breaking rule, as opposed to solely based on vector clocks (which have
guaranteed properties).
4) As long as there is a dangling conflict, the history cannot be
compressed. The resolution of the conflict in each client implies
downloading multiple delta files.
5) while I'm confident the algorithm is sound, it is also original (as
far as I know) and I would feel more secure with a more classical solution.

So I would:

- completely remove the --force option that allows uploading something
without downloading previous databases before. This is a recipe for
failure and I'm not even sure this cannot break the system beyond repair.
- implement a user guided conflict resolution when the vector clocks
cannot be used to guarantee a linear history. For instance in the case
described above, we have C with db-A-2 and db-B-2 that are not causally
related. Knowing that, C will first apply to db-B-1 (the last version in
the linear history) all updates from db-A-2 and db-B-2 that do not
conflict (that is which concerns different files). I'm confident this
will be the more general case. Then for all conflicting updates, the
user will be notified (notice that for text files standard merging
algorithms could be applied automatically as git does). In the end, C
will push a db-C-1 (A=>2, B=>2, C=>1) which will be a sort of merge
commit. Once this is pushed, C can remove from the storage db-A-2 and
db-B-2 as db-C-1 supersedes those. And then it will be allowed to upload
it local changes, for instance. Obviously this can be done also by
either A or B. In any case, the only operation allowed at that time
would be the merge operation, so if another client (including A and B)
tries to upload something, it will first need to implement the
reconciliation between A and B (no --force!).

While we cannot guarantee progress in this case, the need for a human
intervention in case of real conflict will probably limit any risk of
endless loop. However, as pointed out in the Harmony work of Pierce et
al. (http://www.cis.upenn.edu/~bcpierce/papers/nway-tr.pdf), two
identical conflict resolutions done by different clients will be
difficult to detect. So we will need in this case to implement some form
of waiting/locking to ensure someone actually win the race to resolve
the conflict. Maybe the client writes to a .resolution file with its
name in it. Someone is guaranteed to win the race to this file as
uploads are atomic. When one has written to it, it starts to upload its
solution. At the end of the upload, it verifies it actually won the race
(in order to accommodate for eventual consistency). It this is the case,
it commits (i.e. move in the case of ftp like remote) its solution. If
not, it removes it and wait to the winner to complete the reconciliation
process.

Comments?

Best regards,

Fabrice



Follow ups