← Back to team overview

cuneiform team mailing list archive

[Bug 623438] Re: Font size not correct in merged sandvich PDF

 

I have discussed this with somebody who is an expert in PDF and my
current understanding is that for creating the PDF the underlying text
behind the image displayed needs font size, spacing etc information to
be correctly displayed in the viewer.

I noticed that not only the selection in the viewer does not work
correctly. Also a lot of words are not found using the internal search
functionality of viewers (tested with Evince and Adobe Acrobat Reader).

Side note: If I extract the full text using a PDF library I get a
correct looking text (words separated by space, no spaces between
words).

I think that creating a correct sandvich PDF is crucial and wonder why
not more people are interested in this. But I also think, that it is not
easy. I think it would be necessary to get experts in OCR, experts in
PDF and experts in fonts together to solve this. - The key missing thing
IMHO is to get font metric (font name, size, spacing, ...) information
when only having the bounding boxes and contained text. Therefore I
posted also the link above which I find important.

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: Invalid

Bug description:
After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text.

See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
The effect is shown in screen087.png

For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse.

It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html





Follow ups

References