cuneiform team mailing list archive

Thread
Date

[Bug 388926] Re: Lithuanian text recognition: wrong recognition of "ų" as an "ę"

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Ben Jackson <ben@xxxxxxx>
Date: Mon, 06 Jul 2009 03:10:21 -0000
Reply-to: Bug 388926 <388926@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Assuming there is a Lithuanian dictionary (or you create a 'user
dictionary' which I am almost done adding support for) then I believe
the key to making this work is to create a suitable
datafiles/rec9lit.dat entry which tells spelart.c that "ų" and "ę" are
sometimes confused for each other. This is the same list that knows
that 'rn' looks like 'm' and 'vv' looks like 'w'. The rec9lit.dat is
just a copy of the default English one.

Here are some notes as I investigate.

0) I don't understand the source image: It seems to be a screenshot of
a web browser showing the *bad* output? It's not the thing to OCR, is
it??

1) rec6lit.dat defines the Lithuanian alphabet (the char to BYTE
mapping, essentially). (6 is alphabet files, lit is the abbreviation
for Lithuanian)

2) based on the contents of rec6lit.dat and *no* knowledge of
Lithuanian at all my conclusion is that the charset of that file is
cp1257. (that's consistent with mentions of 1257 in the code) (this
picture was useful: http://www.borgendale.com/codepage/cp1257.gif )

3) ...in fact, all of the internal string representations of BYTE seem
to be cp1257

4) (there's a bug in InitializeAlphabet where it uses a global instead
of the passed in arg, which was breaking my dictionary builder! does
not need to be fixed directly for this problem, though)

Ok, I have successfully made a modified rec9lit.dat and attached it (to
the bug). It tells the spelling code about your pair of letters. This
will cause it to try both variations against the stock dictionary and
any user dictionaries. I can see it is trying both even for your jpg
(which has the wrong letter, if I understand correctly). I don't know
if the dictionary that comes with cuneiform knows the words you are
having trouble with. If not, you will need my user dictionary support
as well. I'm still waiting for email about that to appear on the list
:(

** Attachment added: "tells spelart about a new letter transpose pair for lithuanian"
http://launchpadlibrarian.net/28723611/rec9lit.dat

--
Lithuanian text recognition: wrong recognition of "ų" as an "ę"
https://bugs.launchpad.net/bugs/388926
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: New

Bug description:
Using cuneiform 0.7 on Ubuntu 9.04

When ocr-ing a lithuanian text with the switch "-l lit" a large number of letters "ų" that usually go at the end of the word get recognized as "ę".

If someone pointed me to the source file I have to check, I am pretty certain that the solution is simple, as the mistake is very simple. However, I cannot find the file: the closest match - datafiles/*lit.dat are binary and I cannot edit those...

References

[Bug 388926] [NEW] Lithuanian text recognition: wrong recognition of "ų" as an "ę"
From: Donatas Glodenis, 2009-06-18