cuneiform team mailing list archive
-
cuneiform team
-
Mailing list archive
-
Message #00011
Re: charset mess inside source files
Jussi Pakkanen wrote:
On Sun, Aug 24, 2008 at 5:57 AM, Alex Samorukov <samm@xxxxxxxxxxx> wrote:
While hacking image support I found that some source files are using
different character sets. Most of the files are encoded in windows-1251
encoding, while some of the uses UTF-8 charset. This makes my debugger and
editor very unhappy.
These are in directories ced and ccom and are caused by the
translation patch. What are the errors caused by this? AFAICR only the
comments use non US-ASCII characters and all tools I know ignore those
anyway. You can manually set your editor's encoding to ISO-LATIN-1
(8859-1) should you get problems, this is what I do using Eclipse.
I have a proble with jEdit, Java editor which i`m using. If i`m setting
default character set to the unicode it will not global serarch (using
search directory command) on such files. I think it`s jedit bug, but
also i think that its better not to store files in different character
in one project anyway. Also we should be very careful on converting
source code to other cp - some files are using non escaped high ascii
characters inside, so converting them (without patching) may cause
problems.
And about character sets. I had some time to trace the code. Internally
it uses 8bit ansi (windows) character sets, and there is a a possibility
inside engine to set output character set for some formats.
You may specify character set in PUMA_XSave function,third argument (we
are wrongly using 0 in the cli).
Currently defined constants are:
# define PUMA_CODE_UNKNOWN 0x0000
# define PUMA_CODE_ASCII 0x0001
# define PUMA_CODE_ANSI 0x0002
# define PUMA_CODE_KOI8 0x0004
# define PUMA_CODE_ISO 0x0008
If you are defining PUMA_CODE_ANSI in the code it will not convert
anything, other functions will cause 8-bit table conversation inside
rout.cpp. All tables are stored in rout/src/codetables.cpp, also it
contain information about character names for different languages (yes,
they are different):
static long cp_ansi[LANG_TOTAL]={
1251, // LANG_ENGLISH 0
1251, // LANG_GERMAN 1
1251, // LANG_FRENCH 2
1251, // LANG_RUSSIAN 3
1251, // LANG_SWEDISH 4
1251, // LANG_SPANISH 5
1251, // LANG_ITALIAN 6
1251, // LANG_RUS/ENG 7
1251, // LANG_UKRAINIAN 8
1251, // LANG_SERBIAN 9
1250, // LANG_CROATIAN 10
1250, // LANG_POLISH 11
1251, // LANG_DANISH 12
1251, // LANG_PORTUGUESE 13
1251, // LANG_DUTCH 14
1251, // LANG_DIG 15
1251, // LANG_UZBEK 16
1251, // LANG_KAZ 17
1251, // LANG_KAZ_ENG 18
1250, // LANG_CZECH 19 // 05.09.2000 E.P.
1250, // LANG_ROMAN 20
1250, // LANG_HUNGAR 21
1251, // LANG_BULGAR 22
1250, // LANG_SLOVENIAN 23 // 25.05.2001 E.P.
1257, // LANG_LATVIAN 24
1257, // LANG_LITHUANIAN 25
1257, // LANG_ESTONIAN 26
1254 // LANG_TURKISH 27
};
To get active codepage function GetCodePage() exists. To get list of the
defined available codepages - ROUT_ListCodes(PWord8 buf, ULONG sizeBuf).
The list result depends on selected output format. Also it calls to non
implemeted LoadString function to return charset description.
So, my proposal for utf-8 in engine is:
1) define PUMA_CODE_UTF8 inside headers
2) add function like from_ansi_to_utf8() inside codetables.cpp
Place iconv code here with
#ifdef HAVE_ICONV and HAVE_UTF8
#endif
and return error in case of iconv() absence. Later, in win32 engine,
here should be very easy to add winnls code for this, so porting will
not be a hard task. function from_ansi_to_utf8() will use defined PUMA_
characters as argument list, so even charset names will not be system
depended.
3) Modify ROUT_ListCodes to return also utf-8 in case of HAVE_UTF8 is
defined. Add calling of the from_ansi_to_utf8() in the rout library in
case of PUMA_CODE_UTF8 selected and working.
4) Modify cli utility to call ListCodes and use PUMA_CODE_UTF8 in case
of it presence.
5) Add checking of the ICONV presence to the cmake scripts, and define
HAVE_UTF8 and HAVE_ICONV here. In windows it may just setup HAVE_WINNLS
and HAVE_UTF8, and only system depended code will be in
from_ansi_to_utf8() function.
This will allow us to add Unicode support inside the PUMA engine,
without breaking the current API.
If you think that this is correct way I could try to start this.
References