cuneiform team mailing list archive

Thread
Date

[Bug 388926] Re: Lithuanian text recognition: wrong recognition of "ų" as an "ę"

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Ben Jackson <ben@xxxxxxx>
Date: Wed, 08 Jul 2009 22:48:15 -0000
Reply-to: Bug 388926 <388926@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

About hex editing: Luckily this file is a very simple format which was
easy to reverse-engineer from reading the source. Also, there are
several hardcoded entries in the table in spelart.c (plus some #ifdef'd
out ones which are now in the rec9.dat file) so I could tell a lot about
what was in the file. The quotes you see are not really quotes, they're
just binary flags which happen to be ascii quotes.

To edit the file I used perl to read the first record (a special 14 byte
header) and add 1 to the count of entries. Then I copied out the pre-
existing entries and then printed a new one which I constructed by hand
(perl can 'pack' to make binary output easily). I will try to add some
comments to my script and add it to my bzr branch.

About sources to dat files: I have no idea if sources exist or if the
project will ever have them. I am only working with what's in the
repository (and only since last week!). They could probably be reverse-
engineered so we could have editable versions in the repository which
get "compiled" at build time.

About dictionaries in general: There are two kinds of dictionaries:
the builtin ones (which must be pretty good for Lithuanian or otherwise
my patch would not work) and "user dictionaries". There are clear
functions to call to write new user dictionaries, and I am working on
support for that. The format appears to be different than the "builtin"
dictionaries. You could use several user dictionaries to do what you
want (probably not one, there are size limits on user dictionaries which
might not fit 300,000 words).

About character encoding: The source has several functions for doing
*output* encoding, including UTF8. However, internally all strings are
based on 8-bit BYTE arrays. So each language must be compressed into
some 8-bit encoding like cp1257. Every language has an alphabet (you
can easily decipher the rec6*.dat files) but unfortunately the charset
is not specified (it may be cp1257 for all of them). Your bug report
made me think about how to support that in my dictionary creation
program. I didn't see a good way to reuse existing code to allow UTF8
input so I will probably require you to provide input in the charset
already used by cuneiform.

--
Lithuanian text recognition: wrong recognition of "ų" as an "ę"
https://bugs.launchpad.net/bugs/388926
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: In Progress

Bug description:
Using cuneiform 0.7 on Ubuntu 9.04

When ocr-ing a lithuanian text with the switch "-l lit" a large number of letters "ų" that usually go at the end of the word get recognized as "ę".

If someone pointed me to the source file I have to check, I am pretty certain that the solution is simple, as the mistake is very simple. However, I cannot find the file: the closest match - datafiles/*lit.dat are binary and I cannot edit those...

References

[Bug 388926] [NEW] Lithuanian text recognition: wrong recognition of "ų" as an "ę"
From: Donatas Glodenis, 2009-06-18