← Back to team overview

cuneiform team mailing list archive

[Bug 623438] Re: Font size not correct in merged sandvich PDF

 

I have got in touch with the developer - he has very much todo, but I
sent a donation and he looked at the issue (I exchanged a few emails
with him) - here is his final response so far:

On Mon, Sep 13, 2010 at 10:28, Rene Rebe <rene@xxxxxxxxxxxx> wrote:

Dear Martin,

the problem is that the latest cuneiform version completely changed the
way the bounding box information is written. Actually in a way that
makes no sense to me. Before each glyph had a bounding box, which is
exactly what we need to write a proper PDF. Now they have a bounding box
per line (we we do not need at all) and then an additional array of x
start position. However, this can easily get out of sync in regard to
multi-byte utf-8 sequences, and also in regards to whitespace. It would
also be particularly ugly to adapt the horc2pdf HTML parser to cope with
this x position spans written out after the actual text. I doubt this is
valid hOCR, and even if it is, it makes no sense to first write out the
<span> with the text, and then another <span> just for the x
coordinates. And for proper font size estimation we even need the real
y-height of the single glyphs in any case (information not present in
the new format).

I suggest to revert the change that mangled the hOCR annotation in
cuneiform, ... That would approximately be these:

revno: 415
committer: julien <julien@xxxxxxxxxxxxxxxxxxx>
branch nick: cuneiform-linux
timestamp: Wed 2009-10-07 10:10:13 +0200
message:
 moved some tags around, now follows html spec and hocr spec. fixed russian comments that were destroyed during encoding
------------------------------------------------------------
revno: 414
committer: julien <julien@xxxxxxxxxxxxxxxxxxx>
branch nick: cuneiform-linux
timestamp: Fri 2009-10-02 21:48:45 +0200
message:
 separated ocr_line and character bboxes. now follows the hocr standard using the ocr_cinfo tag for char bboxes
------------------------------------------------------------
revno: 413
author: Dmitry Polevoy
committer: julien <julien@xxxxxxxxxxxxxxxxxxx>
branch nick: cuneiform-linux
timestamp: Thu 2009-10-01 17:07:51 +0200
message:
 hocr format now supports ocr_line. Replaced cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform mailing list the 24th of February by Dmitry Polevoy. Cha
nged %d to %l in a few sprintf statements in html.cpp

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: Invalid
Status in “exactimage” package in Ubuntu: New

Bug description:
After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text.

See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
The effect is shown in screen087.png

For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse.

It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html





References