cuneiform team mailing list archive

Thread
Date

Re: charset mess inside source files

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Alex Samorukov <samm@xxxxxxxxxxx>
Date: Mon, 25 Aug 2008 11:21:24 +0200
In-reply-to: <42d23b2e0808250124xbab65f3q122cbe5059ac0717@mail.gmail.com>
User-agent: Thunderbird 2.0.0.16 (X11/20080724)

Jussi Pakkanen wrote:

On Sun, Aug 24, 2008 at 5:57 AM, Alex Samorukov <samm@xxxxxxxxxxx> wrote:

While hacking image support I found that some source files are using
different character sets. Most of the files are encoded in windows-1251
encoding, while some of the uses UTF-8 charset. This makes my debugger and
editor very unhappy.


These are in directories ced and ccom and are caused by the
translation patch. What are the errors caused by this? AFAICR only the
comments use non US-ASCII characters and all tools I know ignore those
anyway. You can manually set your editor's encoding to ISO-LATIN-1
(8859-1) should you get problems, this is what I do using Eclipse.

I have a proble with jEdit, Java editor which i`m using. If i`m setting
default character set to the unicode it will not global serarch (using
search directory command) on such files. I think it`s jedit bug, but
also i think that its better not to store files in different character
in one project anyway. Also we should be very careful on converting
source code to other cp - some files are using non escaped high ascii
characters inside, so converting them (without patching) may cause
problems.

And about character sets.  I had some time to trace the code. Internally
it uses 8bit ansi (windows) character sets, and there is a a possibility
inside engine to set output character set  for some formats.
You may specify character set in PUMA_XSave function,third argument (we
are wrongly using 0 in the cli).
Currently defined constants are:

       # define PUMA_CODE_UNKNOWN    0x0000
       # define PUMA_CODE_ASCII      0x0001
       # define PUMA_CODE_ANSI       0x0002
       # define PUMA_CODE_KOI8       0x0004
       # define PUMA_CODE_ISO        0x0008

If you are defining PUMA_CODE_ANSI in the code it will not convert
anything, other functions will cause 8-bit table conversation inside
rout.cpp. All tables are stored in rout/src/codetables.cpp, also it
contain information about character names for different languages (yes,
they are different):

static long cp_ansi[LANG_TOTAL]={
       1251,   // LANG_ENGLISH    0
       1251,   // LANG_GERMAN     1
       1251,   // LANG_FRENCH     2
       1251,   // LANG_RUSSIAN    3
       1251,   // LANG_SWEDISH    4
       1251,   // LANG_SPANISH    5
       1251,   // LANG_ITALIAN    6
       1251,   // LANG_RUS/ENG    7
       1251,   // LANG_UKRAINIAN  8
       1251,   // LANG_SERBIAN    9
       1250,   // LANG_CROATIAN   10
       1250,   // LANG_POLISH     11
       1251,   // LANG_DANISH     12
       1251,   // LANG_PORTUGUESE 13
       1251,   // LANG_DUTCH      14
       1251,   // LANG_DIG                15
       1251,   // LANG_UZBEK      16
       1251,   // LANG_KAZ        17
       1251,   // LANG_KAZ_ENG    18
       1250,   // LANG_CZECH      19   // 05.09.2000 E.P.
       1250,   // LANG_ROMAN      20
       1250,   // LANG_HUNGAR     21
       1251,   // LANG_BULGAR     22
       1250,   // LANG_SLOVENIAN  23   // 25.05.2001 E.P.
       1257,   // LANG_LATVIAN    24
       1257,   // LANG_LITHUANIAN 25
       1257,   // LANG_ESTONIAN   26
       1254    // LANG_TURKISH    27
};

To get active codepage function GetCodePage() exists. To get list of the
defined available codepages - ROUT_ListCodes(PWord8 buf, ULONG sizeBuf).
The list result depends on selected output format. Also it calls to non
implemeted LoadString function to return charset  description.

So, my proposal for utf-8 in engine is:
1) define PUMA_CODE_UTF8 inside headers
2) add function like from_ansi_to_utf8() inside codetables.cpp
Place iconv code here with
#ifdef HAVE_ICONV and HAVE_UTF8
#endif
and return error in case of iconv() absence.  Later,  in win32 engine,
here should be very easy to add winnls code for this, so porting will
not be a hard task. function from_ansi_to_utf8() will use defined PUMA_
characters as argument list, so even charset names will not be system
depended.
3) Modify ROUT_ListCodes to return also utf-8 in case of HAVE_UTF8 is
defined. Add calling of the from_ansi_to_utf8() in the rout library in
case of PUMA_CODE_UTF8 selected and working.
4) Modify cli utility to call ListCodes and use PUMA_CODE_UTF8 in case
of it presence.
5) Add checking of the ICONV presence to the cmake scripts, and define
HAVE_UTF8 and HAVE_ICONV here. In windows it may just setup HAVE_WINNLS
and HAVE_UTF8, and only system depended code will be in
from_ansi_to_utf8() function.

This will allow us to add Unicode support inside the PUMA engine,
without breaking the current API.
If you think that this is correct way I could try to start this.

References

charset mess inside source files
From: Alex Samorukov, 2008-08-24
Re: charset mess inside source files
From: Jussi Pakkanen, 2008-08-25