syncany-team team mailing list archive
-
syncany-team team
-
Mailing list archive
-
Message #00645
Re: Illegal file names
Hello again,
> You have at least to record the encoding used by the creator of the
> repository.
Can you elaborate on why recording the encoding (locale, LC_CTYPE, I
assume) of the creator of a database version (not repository!) would
help? Because: Java internally handles file names as UTF8 (correct?) and
the database files are encoded as UTF8 ("-Dfile.encoding=UTF8" is set).
And when a file on another user's PC has to be created, the only issue
is how to encode the filename in the local encoding, e.g. to maybe
translate it to ISO-8859-1 or something else, right?!
I am not an expert on encoding stuff, so excuse my stupid questions :-)
I tried to understand how filenames are encoded, here's what I believe
to have found out:
- The standard file systems of all major OSs support Unicode filenames
- There are different illegal characters (e.g. "/", "\", ":" on Windows)
and filenames (e.g. "COM" or"" on Windows)
- Some file systems are case-senstive, others aren't
- The OSs locale influences how with which encoding file names are
stored (for all file systems?)
e.g. LC_CTYPE="en_US.iso88591" java org.syncany.Syncany will encode
file names differently
I played around a little more here: http://pastebin.com/JEZdXtmT
Results here were:
- file.encoding has nothing to do with filenames, only with file
contents (Input/OutputStreams, I think)
- LC_CTYPE/LC_ALL are the relevant environment variables when it comes
to filename encoding.
>> - Unclear how this behaves when "FILE" and "file" are created on two
>> different machines and then sync'd...
> Based on the winning strategy implemented in the Reconciliator, I guess
> one version will win and prevent the other one from being ever synced.
True :-)
> Each local database should include local only information that are not
> synchronized. This will be a major win to handle fileKey (aka inode)
> for modification tracking. Once you have the infrastructure to do
> that, you can do fancy tricks such as: - saying you don't want to care
> about a file locally or remotely (mark this file as unsynchronized
> from the remote or to the remote) - mapping a file name to something
> that can be handled locally (such as slugifying a unix filename into a
> windows one) - implementing whatever per file transformation you want
> via plugins.
Dreaming of fancy stuff again, I see :-)
But, yes, that'd be awesome.
> Or slugify.
That's generally the best solution. Although it's harder than it sounds,
because it has to be target-platform-specific AND target encoding-specific.
Example: A "black\white" ☎ telephone.jpg // There is a telphone symbol
in the filename: U+2121)
- On Windows, UTF-16 ->A black-white ☎ telephone (filename
conflict).jpg // Still a telephone symbol in the filename
- On Windows, CP-1252/Windows-1252 -> A black-white telephone (filename
conflict).jpg
- On Linux, UTF8 -> A "black\white" ☎ telephone.jpg // No
conflict, still a telephone symbol in the filename
- On Linux, ISO-8859-1 -> A "black\white" telephone (filename conflict).jpg
- ...
Instead of trying to implement endless cases, my strategy would be to
try out different filenames:
if (isWindows) {
new file := eliminate illegal windows chars(old file)
new file := eliminate non-target encoding chars(new file) // based
on Locale.getDefault() ?!?
try:
move(old file, new file)
exception:
new file := eliminate all non-ASCII chars(new file)
move(old file, new file)
}
...
Thoughts?
Best
Philipp
Follow ups
References