cuneiform team mailing list archive
-
cuneiform team
-
Mailing list archive
-
Message #00046
Re: PDF output
Hi again,
I had to find out that using ocropus as hOCR reference output is not
of too much value, as we would need a bounding box at least per line,
to "sort of" accurately position the text in the PDF behind the image
layer, and as far as I have seen ocropus currently only outputs a
bounding box for the whole body, ... which is not really of much value
if you want to position the various glyphs sort of correctly behind
the image.
So I have to skip the first test with ocropus and go back straight to cuneiform.
Jussi: Do you have enough overview of the code to add bounding box
information to the text lines of the HTML output? While the HTML
output code is quite straight forward, I have not quickly found how to
access the positional informations of the elements written out.
Ideally, we should also generate span tags for each line of text to
have a change to add the bounding box to each line.
I'll now try to probe the structures from within a debugger to
hopefully get a better visualization of the in-memory document
structure and find the bounding box information.
PS: Something does not yet work quite right for me with the cuneiform
launchpad mailing list, though I became team member and subscribed to
the list I apparently do not receive copies of the messages, which
makes following the development or replying a little more difficult
than it should be ...
René
Follow ups
References