← Back to team overview

cuneiform team mailing list archive

[Bug 585418] Re: can produce hOCR with illegal UTF-8 sequences

 

Since I am not familiar with cyrillic (which you'll probably get because
you are using ruseng), could you please specify:

- which recognized character is the issue
- what UTF-8 sequence it produces
- what is the correct UTF-8 sequence for that character

-- 
can produce hOCR with illegal UTF-8 sequences
https://bugs.launchpad.net/bugs/585418
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: New

Bug description:
Cuneiform can produce hOCR that contains illegal UTF-8 sequences:

$ cuneiform -l ruseng -f hocr -o test.html test.png 
Cuneiform for Linux 0.9.0

$ grep -i utf-8 test.html 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

$ iconv -f UTF-8 -t UTF-8 < test.html > /dev/null
iconv: illegal input sequence at position 401





Follow ups

References