← Back to team overview

syncany-team team mailing list archive

Re: Illegal file names

 

Hello again,
> You have at least to record the encoding used by the creator of the
> repository. 

Can you elaborate on why recording the encoding (locale, LC_CTYPE, I
assume) of the creator of a database version (not repository!) would
help? Because: Java internally handles file names as UTF8 (correct?) and
the database files are encoded as UTF8 ("-Dfile.encoding=UTF8" is set).
And when a file on another user's PC has to be created, the only issue
is how to encode the filename in the local encoding, e.g. to maybe
translate it to ISO-8859-1 or something else, right?!

I am not an expert on encoding stuff, so excuse my stupid questions :-)

I tried to understand how filenames are encoded, here's what I believe
to have found out:
- The standard file systems of all major OSs support Unicode filenames
- There are different illegal characters (e.g. "/", "\", ":" on Windows)
and filenames (e.g. "COM" or"" on Windows)
- Some file systems are case-senstive, others aren't
- The OSs locale influences how with which encoding file names are
stored (for all file systems?)
  e.g. LC_CTYPE="en_US.iso88591" java org.syncany.Syncany will encode
file names differently

I played around a little more here: http://pastebin.com/JEZdXtmT

Results here were:
- file.encoding has nothing to do with filenames, only with file
contents (Input/OutputStreams, I think)
- LC_CTYPE/LC_ALL are the relevant environment variables when it comes
to filename encoding.

>> - Unclear how this behaves when "FILE" and "file" are created on two
>> different machines and then sync'd...
> Based on the winning strategy implemented in the Reconciliator, I guess
> one version will win and prevent the other one from being ever synced.

True :-)

> Each local database should include local only information that are not
> synchronized. This will be a major win to handle fileKey (aka inode)
> for modification tracking. Once you have the infrastructure to do
> that, you can do fancy tricks such as: - saying you don't want to care
> about a file locally or remotely (mark this file as unsynchronized
> from the remote or to the remote) - mapping a file name to something
> that can be handled locally (such as slugifying a unix filename into a
> windows one) - implementing whatever per file transformation you want
> via plugins.
Dreaming of fancy stuff again, I see :-)
But, yes, that'd be awesome.

> Or slugify.
That's generally the best solution. Although it's harder than it sounds,
because it has to be target-platform-specific AND target encoding-specific.

Example: A "black\white" ☎ telephone.jpg   // There is a telphone symbol
in the filename: U+2121)
- On Windows, UTF-16 ->A black-white ☎ telephone (filename
conflict).jpg     // Still a telephone symbol in the filename
- On Windows, CP-1252/Windows-1252 -> A black-white telephone (filename
conflict).jpg
- On Linux, UTF8  -> A "black\white" ☎ telephone.jpg     // No
conflict,  still a telephone symbol in the filename
- On Linux, ISO-8859-1 -> A "black\white" telephone (filename conflict).jpg
- ...

Instead of trying to implement endless cases, my strategy would be to
try out different filenames:

if (isWindows) {
  new file := eliminate illegal windows chars(old file)
  new file := eliminate non-target encoding chars(new file)    // based
on Locale.getDefault() ?!?
  try:
    move(old file, new file)
  exception:
    new file := eliminate all non-ASCII chars(new file)
    move(old file, new file)
}
...
 
Thoughts?

Best
Philipp


Follow ups

References