kicad-developers team mailing list archive

Thread
Date
Re: filename fun

To: <cirilo.bernardo@xxxxxxxxx>, KiCad Developers <kicad-developers@xxxxxxxxxxxxxxxxxxx>
From: "lomarcan@xxxxxx" <lomarcan@xxxxxx>
Date: Sun, 26 Feb 2017 10:19:21 +0100 (CET)
Reply-to: "lomarcan@xxxxxx" <lomarcan@xxxxxx>
Oh the unholy mess from extended charsets:D 

Not directly related to 
the problem under hand (but may contain suggestions) here what I have 
learnt during my endless wrestling with foreign charsets. Mostly a rant

Had to localize an application in japanese. Using ISO2022. On an 
embedded system with 7 bit fonts. Hint: it wasn't fun.

I believe that 
90% of the issues with character set stems from 'wrong' separation from 
internal and external code representation. By the programmer or by the 
developer of sucky libraries.
Even wx itself can be build differently 
with strings using UTF8 or wchar_t (IIRC). OF COURSE the result isn't 
compatible, not even at the source level (that was one of my first 
issues while compiling kicad, I didn't have the 'right' wx library).


First of all, the language view of the character. In C (C++ too, they 
only refined a little the locale subsystem) we have char and wchar_t. 
Remember that for the language char is an integer and it is neither 
signer or unsigned (fun part of the standard, that!)

For every useful 
purpose char is only for ISO686 i.e. ASCII (*technically* not even 
latin-1 or cp1252, but that's a 'compiler specific thing') and wchar_t 
for everything bigger. AFAIK current gcc has a 4 byte wchar_t (in UCS4) 
but *maybe* msvc uses a 2 byte wchar_t (in UCS2, I'm not sure about 
that). Maybe these days Windows has full unicode but last time I 
checked, many years ago, it only supported the BMP, which is fine in 
99.999% of the applications.

Important thing: one char -> one ASCII 
character, one wchar_t -> one unicode character. From the ANSI/ISO 
point of view multibyte encoding are not allowed since otherwise 
strlen/wstrlen and similar wouldn't work correctly.

In C++ the strings 
should be smart enough to 'auto specialize' themselves to the passed 
type (std::string and std::wstring are typedefs or such for some std::
basic_string<> template trickery which I refuse to understand)

Then 
the external representation: Linux uses UTF-8, OS X uses UTF-8 (with a 
different canonicalization rule), Windows (last time I checked) UTF-16.

Strictly speaking utf-8 should be stored as uint8_t sequences but since 
it's so clever it's mostly compatible with ISO686 so it can be passed 
directly to syscalls.

In fact the unix kernel doesn't care of the 
encoding of the filenames for most of the times (on a unix FS); the 
only important characters are NUL (the terminator) and / (the path 
separator). You can make a file named with BELs and ls should happily 
beep when printing its name (unless it's a smart ls like the current 
ones which escapes unprintables)

On Windows IIRC there are two 
families of functions, the famous A and W which get compiled in 
depending on UNICODE. Also there is WCHAR which may or may be not a 
wchar_t (in microsoft defense, wchar_t wasn't invented yet at the 
time).

The 'correct' way to handle it would be: process *everything* 
in core as wchar_t; then there is NO WAY to portably pass them to the 
OS in both SuS and Windows!
The unix way would be to convert from 
wchar_t, use the locale functions (yes, NO ONE said that filenames are 
UTF-8, you could encode them as ISO2022 if the locale was set in such a 
way) to get a uint8_t representation and then pass it to open(2) to get 
an handle.
The windows way would be to convert from wchar_t to WCHAR 
with some SDK function (maybe they are the same, after all, but under 
gcc I don't think so) and then pass it to OpenW to get an handle.


Inside the file it's no men's land but kicad standardized on utf8 
without BOM marker, so that would be it.

IN PRACTICE almost nobody 
uses wchar_t and unix standardized on UTF-8 on the inside (at least 
gtk, explicitly) because wchar_t is huge and most of the libraries 
around takes char*

Hell ensues since developers forgot that these 
char* are not 'simple' characters and you need to special case every 
single system call; there's no way about it.

Also on I/O the reports 
for kicad don't line up correctly for the same nature of UTF-8 (one 
byte != one output cell; in fact there are characters which take TWO 
output cells!)

WX is guilty from the build option itself. Opencascade 
has the noticed problem (it fails to special case for MINGW). *Every* 
library which takes a filename (libc is not exempted, in fact on 
windows there is tchar.h!) need to take in account the encoding, it's 
simply not correct to pass it to fopen or something.
For correctly 
treating unicode you need huge libraries: as hinted before a valid UTF8 
filename under Linux is not so under OS X since the canonicalization is 
different: Linux wan't precomposed characters, OS X is designed around 
decomposed characters... you could end with two files with the SAME 
NAME but different encoding and who know how many more subtle issues...


My personal practice for handling filenames is to relegate the opening 
in one place and then pass around the handles (or what? FOPEN*? 
fstream&? tough decision). Also notice that a function in the end under 
windows uses OpenA there is NO WAY it could handle an extended name. 
Good luck with that

Have fun