← Back to team overview

cuneiform team mailing list archive

Re: Hocr and html - new version

 

On Wed, Oct 7, 2009 at 11:23 AM, julien <julien@xxxxxxxxxxxxxxxxxxx> wrote:

> The hocr/html output now passes the wc3 validation.

Excellent.

> Regarding the russian comments that came out wrong:
>
> I have fixed the comments. First time so probably good if someone could quickly skim and see if it seems alright.

It still happens:

-    // ?????? ?? ??? ?????????
+    // ������ �� ��� ��������?

You can get the difference from cf-linux trunk to your Launchpad
branch with this command:

bzr diff --new lp:~julien-student/cuneiform-linux/hocroutput --old
lp:cuneiform-linux

> I used:  iconv -f cp1251 -t utf8
> on the original file, then copied in all comments, and then reversed: iconv -f utf8 -t cp1251
> so now the file should be in cp1251.

Please do not do this. Seriously. You will break stuff, probably
silently and devilishly. I'll quote your other message here to keep
the discussion coherent.

> Could we take a decision to go to utf8?
>
> Imagine postponing it longer, having more branches created, more messed up comments, maybe someone changing their
> encoding.. would be very difficult then to start pull/push between branches. Better start early and avoid the issue.

If it were up to me, we would already be utf-8. Unfortunately we can not change.

Firstly, the code uses 8-bit characters that must not be converted.
Such as this:

char *somelist = "öä%...etc.etc...";

Converting would mean creating a program that separates code from
comments, converts the latter with iconv and the former with proper C
escape characters. This is equivalent to writing a full C++ parser. If
such a program exists, great. Otherwise we are stuck with the current
mess.

Secondly Cognitive people are working on the openocr.org version of
Cuneiform. We want to eventually merge with them. Changing the
encoding (or fixing the indentation or pretty much anything major,
really) makes this almost impossible.

So unfortunately the solution currently is to either get used to
looking at unicode replacement characters or change your editor's
encoding for this project. Sorry.



Follow ups

References