cuneiform team mailing list archive

Thread
Date

Re: PDF output

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: "Jussi Pakkanen" <jpakkane@xxxxxxxxx>
Date: Mon, 8 Sep 2008 11:35:22 +0300
In-reply-to: <84c9f6b0809060537r3fc32e14nbe830d1acdefb618@mail.gmail.com>

On Sat, Sep 6, 2008 at 3:37 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:

> Jussi: Do you have enough overview of the code to add bounding box
> information to the text lines of the HTML output? While the HTML
> output code is quite straight forward, I have not quickly found how to
> access the positional informations of the elements written out.

I asked about this on the russian Cuneiform forum:

http://openocr.org/forum/viewtopic.php?f=7&t=2829

I got the following information via email from a person at Cognitive:

---8<----

Cuneform in low level doesn't have such things like "text paragraph",
but rfrmt library take blocks (text block, image block and so on),
text fragments (lines of text) and create rtf-like document
description. ced library is a container for document rtf-like
description and I think you can try extract paragraph border or set of
text lines for particular paragraph (and find paragraph border as a
cover rectangle for text lines rectangles)

---8<----

> Are there any preferences where to add this hOCR related HTML
> annotation? Conditionalized into the existing html writer, or as a
> second copy of it adding those bounding boxes and possibly some post
> processing for the issues mentioned above?

I see no point in duplicating the HTML writer part as hOCR just adds
some simple tags. The only reason to not always have hOCR tags is that
they can bloat the size of the file. Having a span/bbox for every
single character quickly adds up.

Paragraph-sized bounding boxes would not bloat up the file all that
much, but as mentioned above, they seem to be directly accessible.

References

PDF output
From: René Rebe, 2008-09-02
Re: PDF output
From: René Rebe, 2008-09-06