cuneiform team mailing list archive

Thread
Date

Re: Once again about hOCR

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Marcin Miłkowski <milek_pl@xxxxx>
Date: Sat, 02 Jan 2010 00:25:03 +0100
In-reply-to: <20091119235343.368cfe7d.anagnost@yandex.ru>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.1.5) Gecko/20091204 Thunderbird/3.0 ThunderBrowse/3.2.6.8

Hi,

W dniu 2009-11-19 21:53, Alexey Kryukov pisze:

Hi,

I am new to this list and I am interesting in using hOCR in order to
generate a hidden text layer in DjVu books.

I see hOCR support has been greatly improved after the latest release.
However still there are some glitches. First of all, the format
currently used for x_bboxes data looks a bit strange: cuneiform first
writes a text line and then an empty<span>  element with character bbox
info, i.e.:

<span class='ocr_line'...>Some text<span class='ocr_cinfo'...></span></span>

I may be wrong here, but, according to my understanding of the spec this
<span>  should rather enclose the corresponding text, i. e.:

<span class='ocr_line'...><span class='ocr_cinfo'...>Some text</span></span>

I am not sure writing a parser for the currently produced hOCR would make any
sense, as such a parser probably would be incompatible with the output of
other hOCR-capable engines. Can anybody comment on this issue?

Well, I can see that ocrodjvu that includes djvu2hocr creates the tagsthe way you describe. It's probably a slight glitch in cuneiform code.

Anyway, I compiled the current version a couple of days ago and foundthat the image value for title is always "none.txt", which is plainlyincorrect:

<div class='ocr_page' id='page_1' title='image "none.txt"; bbox 0 0 28162112'>

I'm using a sed script to correct it for my purposes (I run it in aloop, so I always know what image has been passed to cuneiform) but itshould be fixed in the code as well, I guess. Or maybe I'm missingsomething?


Regards
Marcin

Follow ups

Re: Once again about hOCR
From: Yury V. Zaytsev, 2010-01-02

References

Once again about hOCR
From: Alexey Kryukov, 2009-11-19