cuneiform team mailing list archive
-
cuneiform team
-
Mailing list archive
-
Message #00337
[Bug 388926] Re: Lithuanian text recognition: wrong recognition of "ų" as an "ę"
Hey Ben, the patched file changed it all - I tried to ocr the page I
have previously included in the attachment (yes, it was ocr'ed text in
word processor with mistakes highlighted, not the page to ocr) and it
corrected all the instances of the wrong recognition of "ų" as "ę".
What hex editor did you use to modify the binary file? I tried to use KDE Okteta, and I could *replace* symbols, but not *add* new ones... Anyway, there are quotation marks with each pair in the file rec9lit.dat; in some cases there is only one pair, and in other cases - couple pairs: mrn""rnm"nnrm""dcl""cld"ce"ec"li"
How do I know if single or double quotation marks apply?
Couple more questions: Are there sources anywhere for the Lithuanian
dictionary? Or could someone convert it to a text format? I have
negotiated a 300 000 word dictionary with one institution in Lithuania
to be used with Tesseract OCR, and I think I could do the same for
Cuneiform (that dictionary would be free for usage, but not open source,
and distributed only in binary format). This dictionary would cover >
80% of all words occuring in Lithuanian texts... I could try to
experiment with it on Cuneiform and report the results.
Another note: the cp1257 encoding (you guessed it correctly) is
Microsoft default for Windows in Lithuanian but it is not even an iso
standard. Coud we perhaps use utf8 encoding instead?
Thank you Ben for taking interest in this
--
Lithuanian text recognition: wrong recognition of "ų" as an "ę"
https://bugs.launchpad.net/bugs/388926
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.
Status in Linux port of Cuneiform: In Progress
Bug description:
Using cuneiform 0.7 on Ubuntu 9.04
When ocr-ing a lithuanian text with the switch "-l lit" a large number of letters "ų" that usually go at the end of the word get recognized as "ę".
If someone pointed me to the source file I have to check, I am pretty certain that the solution is simple, as the mistake is very simple. However, I cannot find the file: the closest match - datafiles/*lit.dat are binary and I cannot edit those...
References