← Back to team overview

cuneiform team mailing list archive

Re: PDF output

 

Hi again,

I had to find out that using ocropus as hOCR reference output is not
of too much value, as we would need a bounding box at least per line,
to "sort of" accurately position the text in the PDF behind the image
layer, and as far as I have seen ocropus currently only outputs a
bounding box for the whole body, ... which is not really of much value
if you want to position the various glyphs sort of correctly behind
the image.

So I have to skip the first test with ocropus and go back straight to cuneiform.

Jussi: Do you have enough overview of the code to add bounding box
information to the text lines of the HTML output? While the HTML
output code is quite straight forward, I have not quickly found how to
access the positional informations of the elements written out.

Ideally, we should also generate span tags for each line of text to
have a change to add the bounding box to each line.

I'll now try to probe the structures from within a debugger to
hopefully get a better visualization of the in-memory document
structure and find the bounding box information.

PS: Something does not yet work quite right for me with the cuneiform
launchpad mailing list, though I became team member and subscribed to
the list I apparently do not receive copies of the messages, which
makes following the development or replying a little more difficult
than it should be ...

René



Follow ups

References