cuneiform team mailing list archive
-
cuneiform team
-
Mailing list archive
-
Message #00294
Re: Planning a new release
On Tue, May 12, 2009 at 8:56 AM, Dmitry Polevoy
<openocr.polevoy@xxxxxxxxx> wrote:
> The main problem with UTF8 source code is that Cyrillic characters are used
> in
> the source code and conversion can broke program. We should make strong test
> system to check OCR results before we start expiriments with conversion. At
> present time compatibility with Visual Studio 6 is important and this is the
> second reason to not convert source codes.
There are two different character set questions here. One is the
character set that Cuneiform uses internally. We don't change that.
The dat files are not touched, no functionality is changed. The other
encoding is the one used in the source files, specifically in the
comments. Merely converting comments is not a problem: You just
compute md5sums of the output files before and after the conversion.
If they are the same (and they should be), you can be certain that
nothing has broken.
There may be an issue with character values > 127 that are in the
code. Probably they will show up as the unicode "unknown symbol"
character but since C compilers ignore encodings and just treat files
as a byte stream, it should still work. If this causes problems we can
replace the problematic values with proper character escape codes.
> Is source codes in win-1251 realy so bad? (I have lack of experience in this
> field)
The main problem is that it looks very ugly. Here is a random sample:
/********** Çàãîëîâîê *******************************************************/
/* Àâòîð, */
/* êîììåíòàðèè */
/* è äàëíåéøàÿ */
/* ïðàâêà : Àëåêñåé Êîíîïëåâ */
/* Ðåäàêöèÿ : 08.06.00 */
/* Ôàéë : 'Normalise.cpp' */
/* Ñîäåðæàíèå : Íîðìàëèçàöèÿ ñûðüÿ */
/* Íàçíà÷åíèå : */
/*----------------------------------------------------------------------------*/
If you could manually set your text editor widget to use win-1251 then
this would be a less of an issue. But at least Eclipse can't do that.
> Not source files (like read.me, lisence and so on) can be converted to UTF8
> if it is required.
The latest checkin does this already as an experiment. You can try
copying license.txt to test.c or something, opening it with VS 6 and
telling if it works for you.
Follow ups
References