← Back to team overview

cuneiform team mailing list archive

Checked in basic User Dictionary support

 

My branch ( lp:~ben.jackson/cuneiform-linux/working ) has basic support
for user dictionaries.

Create a wordlist in the right character set.  Figure out what character
set your language is in by looking at /usr/local/share/cuneiform/rec6XXX.dat
where XXX is your language (eg 'lit' for Lithuanian, which happens to be
in CP1257):

	$ vi wordlist.txt

"Compile" this to a user dictionary using the new utility.  This must be
done as root (or at least with write perms to /usr/local/share/cuneiform)
because the path components are added internally by the library):

	$ sudo cuneiform-dict -l <lang> -o <name>.voc wordlist.txt

You will now have '<name>.voc' in your data dir.  I just made up the
extension 'voc' based on the internal library functions voc_* (for
"vocabulary").

Now use it:

	$ cuneiform -l <lang> --dictionary <name>.voc ...

(you can specify --dictionary more than once)

Here's what to expect:  If cuneiform sees a word it's not sure about, your
user dictionary can help it pick from the options it sees.  So if you see
it OCR something as "example.corn" (because "corn" is an English word) you
can create a user dictionary with "com" in it and try again and it will
probably turn into "example.com".

This ties in directly with the work I did for the Lithuanian word ending
bug.  The 'rec9*.dat' files tell cuneiform about letters that are often
confused.  To fix a recognition problem you may need BOTH a new rec9 entry
(to cause cuneiform to try more variations on a word) AND a user dictionary
(to give it some way to validate the newly generated options).

-- 
Ben Jackson AD7GD
<ben@xxxxxxx>
http://www.ben.com/



Follow ups