cuneiform team mailing list archive

Thread
Date

Once again about hOCR

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Alexey Kryukov <anagnost@xxxxxxxxx>
Date: Thu, 19 Nov 2009 23:53:43 +0300
Organization: Moscow State University

Hi,

I am new to this list and I am interesting in using hOCR in order to
generate a hidden text layer in DjVu books.

I see hOCR support has been greatly improved after the latest release.
However still there are some glitches. First of all, the format
currently used for x_bboxes data looks a bit strange: cuneiform first
writes a text line and then an empty <span> element with character bbox
info, i.e.:

<span class='ocr_line'...>Some text<span class='ocr_cinfo'...></span></span>

I may be wrong here, but, according to my understanding of the spec this
<span> should rather enclose the corresponding text, i. e.:

<span class='ocr_line'...><span class='ocr_cinfo'...>Some text</span></span>

I am not sure writing a parser for the currently produced hOCR would make any
sense, as such a parser probably would be incompatible with the output of 
other hOCR-capable engines. Can anybody comment on this issue?

Moreother, one final "</span> per line is still written if html output
(i. e. no hOCR tags) is requested. So the generated html is essentially
invalid.

-- 
Regards,
Alexey Kryukov <anagnost at yandex dot ru>

Moscow State University
Historical Faculty

Follow ups

Re: Once again about hOCR
From: Marcin Miłkowski, 2010-01-01
Re: Once again about hOCR
From: Yury V. Zaytsev, 2009-11-19