kicad-developers team mailing list archive
-
kicad-developers team
-
Mailing list archive
-
Message #28243
Re: filename fun
Oh the unholy mess from extended charsets:D
Not directly related to
the problem under hand (but may contain suggestions) here what I have
learnt during my endless wrestling with foreign charsets. Mostly a rant
Had to localize an application in japanese. Using ISO2022. On an
embedded system with 7 bit fonts. Hint: it wasn't fun.
I believe that
90% of the issues with character set stems from 'wrong' separation from
internal and external code representation. By the programmer or by the
developer of sucky libraries.
Even wx itself can be build differently
with strings using UTF8 or wchar_t (IIRC). OF COURSE the result isn't
compatible, not even at the source level (that was one of my first
issues while compiling kicad, I didn't have the 'right' wx library).
First of all, the language view of the character. In C (C++ too, they
only refined a little the locale subsystem) we have char and wchar_t.
Remember that for the language char is an integer and it is neither
signer or unsigned (fun part of the standard, that!)
For every useful
purpose char is only for ISO686 i.e. ASCII (*technically* not even
latin-1 or cp1252, but that's a 'compiler specific thing') and wchar_t
for everything bigger. AFAIK current gcc has a 4 byte wchar_t (in UCS4)
but *maybe* msvc uses a 2 byte wchar_t (in UCS2, I'm not sure about
that). Maybe these days Windows has full unicode but last time I
checked, many years ago, it only supported the BMP, which is fine in
99.999% of the applications.
Important thing: one char -> one ASCII
character, one wchar_t -> one unicode character. From the ANSI/ISO
point of view multibyte encoding are not allowed since otherwise
strlen/wstrlen and similar wouldn't work correctly.
In C++ the strings
should be smart enough to 'auto specialize' themselves to the passed
type (std::string and std::wstring are typedefs or such for some std::
basic_string<> template trickery which I refuse to understand)
Then
the external representation: Linux uses UTF-8, OS X uses UTF-8 (with a
different canonicalization rule), Windows (last time I checked) UTF-16.
Strictly speaking utf-8 should be stored as uint8_t sequences but since
it's so clever it's mostly compatible with ISO686 so it can be passed
directly to syscalls.
In fact the unix kernel doesn't care of the
encoding of the filenames for most of the times (on a unix FS); the
only important characters are NUL (the terminator) and / (the path
separator). You can make a file named with BELs and ls should happily
beep when printing its name (unless it's a smart ls like the current
ones which escapes unprintables)
On Windows IIRC there are two
families of functions, the famous A and W which get compiled in
depending on UNICODE. Also there is WCHAR which may or may be not a
wchar_t (in microsoft defense, wchar_t wasn't invented yet at the
time).
The 'correct' way to handle it would be: process *everything*
in core as wchar_t; then there is NO WAY to portably pass them to the
OS in both SuS and Windows!
The unix way would be to convert from
wchar_t, use the locale functions (yes, NO ONE said that filenames are
UTF-8, you could encode them as ISO2022 if the locale was set in such a
way) to get a uint8_t representation and then pass it to open(2) to get
an handle.
The windows way would be to convert from wchar_t to WCHAR
with some SDK function (maybe they are the same, after all, but under
gcc I don't think so) and then pass it to OpenW to get an handle.
Inside the file it's no men's land but kicad standardized on utf8
without BOM marker, so that would be it.
IN PRACTICE almost nobody
uses wchar_t and unix standardized on UTF-8 on the inside (at least
gtk, explicitly) because wchar_t is huge and most of the libraries
around takes char*
Hell ensues since developers forgot that these
char* are not 'simple' characters and you need to special case every
single system call; there's no way about it.
Also on I/O the reports
for kicad don't line up correctly for the same nature of UTF-8 (one
byte != one output cell; in fact there are characters which take TWO
output cells!)
WX is guilty from the build option itself. Opencascade
has the noticed problem (it fails to special case for MINGW). *Every*
library which takes a filename (libc is not exempted, in fact on
windows there is tchar.h!) need to take in account the encoding, it's
simply not correct to pass it to fopen or something.
For correctly
treating unicode you need huge libraries: as hinted before a valid UTF8
filename under Linux is not so under OS X since the canonicalization is
different: Linux wan't precomposed characters, OS X is designed around
decomposed characters... you could end with two files with the SAME
NAME but different encoding and who know how many more subtle issues...
My personal practice for handling filenames is to relegate the opening
in one place and then pass around the handles (or what? FOPEN*?
fstream&? tough decision). Also notice that a function in the end under
windows uses OpenA there is NO WAY it could handle an extended name.
Good luck with that
Have fun