cuneiform team mailing list archive

Thread
Date

Re: Hocr and html - new version

From: julien <julien@xxxxxxxxxxxxxxxxxxx>
Date: Wed, 7 Oct 2009 11:05:53 +0000
Accept-language: sv-SE, en-US
Cc: "cuneiform@xxxxxxxxxxxxxxxxxxx" <cuneiform@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <42d23b2e0910070216h287a9047lfb3e1ad2134f70e7@mail.gmail.com>
Thread-index: AQHKRy7p80rz0uDCIUiF2dt3APbJMJD57eeK
Thread-topic: [Cuneiform] Hocr and html - new version

I see. Thanks for the info.

I have pushed a new rev. and went through the bzr diff. It seems OK now.
(Note that some +// ��... are from the first version of the ocr_line support which was made by Dmitry)

Regards
Julien

________________________________________
Från: Jussi Pakkanen [jpakkane@xxxxxxxxx]
Skickat: den 7 oktober 2009 11:16
Till: julien
Cc: cuneiform@xxxxxxxxxxxxxxxxxxx
Ämne: Re: [Cuneiform] Hocr and html - new version

On Wed, Oct 7, 2009 at 11:23 AM, julien <julien@xxxxxxxxxxxxxxxxxxx> wrote:

> The hocr/html output now passes the wc3 validation.

Excellent.

> Regarding the russian comments that came out wrong:
>
> I have fixed the comments. First time so probably good if someone could quickly skim and see if it seems alright.

It still happens:

-    // ?????? ?? ??? ?????????
+    // ������ �� ��� ��������?

You can get the difference from cf-linux trunk to your Launchpad
branch with this command:

bzr diff --new lp:~julien-student/cuneiform-linux/hocroutput --old
lp:cuneiform-linux

> I used:  iconv -f cp1251 -t utf8
> on the original file, then copied in all comments, and then reversed: iconv -f utf8 -t cp1251
> so now the file should be in cp1251.

Please do not do this. Seriously. You will break stuff, probably
silently and devilishly. I'll quote your other message here to keep
the discussion coherent.

> Could we take a decision to go to utf8?
>
> Imagine postponing it longer, having more branches created, more messed up comments, maybe someone changing their
> encoding.. would be very difficult then to start pull/push between branches. Better start early and avoid the issue.

If it were up to me, we would already be utf-8. Unfortunately we can not change.

Firstly, the code uses 8-bit characters that must not be converted.
Such as this:

char *somelist = "öä%...etc.etc...";

Converting would mean creating a program that separates code from
comments, converts the latter with iconv and the former with proper C
escape characters. This is equivalent to writing a full C++ parser. If
such a program exists, great. Otherwise we are stuck with the current
mess.

Secondly Cognitive people are working on the openocr.org version of
Cuneiform. We want to eventually merge with them. Changing the
encoding (or fixing the indentation or pretty much anything major,
really) makes this almost impossible.

So unfortunately the solution currently is to either get used to
looking at unicode replacement characters or change your editor's
encoding for this project. Sorry.

Follow ups

Re: Hocr and html - new version
From: Jussi Pakkanen, 2009-10-09

References

Hocr and html - new version
From: julien, 2009-10-07
Re: Hocr and html - new version
From: Jussi Pakkanen, 2009-10-07