cuneiform team mailing list archive

Thread
Date

Re: PDF output

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: "René Rebe" <rene.rebe@xxxxxxxxx>
Date: Sat, 6 Sep 2008 14:37:50 +0200
In-reply-to: <84c9f6b0809020422x33f0f5a9l66feb6fff78bfff1@mail.gmail.com>

Hi again,

I had to find out that using ocropus as hOCR reference output is not
of too much value, as we would need a bounding box at least per line,
to "sort of" accurately position the text in the PDF behind the image
layer, and as far as I have seen ocropus currently only outputs a
bounding box for the whole body, ... which is not really of much value
if you want to position the various glyphs sort of correctly behind
the image.

So I have to skip the first test with ocropus and go back straight to cuneiform.

Jussi: Do you have enough overview of the code to add bounding box
information to the text lines of the HTML output? While the HTML
output code is quite straight forward, I have not quickly found how to
access the positional informations of the elements written out.

Ideally, we should also generate span tags for each line of text to
have a change to add the bounding box to each line.

I'll now try to probe the structures from within a debugger to
hopefully get a better visualization of the in-memory document
structure and find the bounding box information.

PS: Something does not yet work quite right for me with the cuneiform
launchpad mailing list, though I became team member and subscribed to
the list I apparently do not receive copies of the messages, which
makes following the development or replying a little more difficult
than it should be ...

René

Follow ups

Re: PDF output
From: Jussi Pakkanen, 2008-09-08

References

PDF output
From: René Rebe, 2008-09-02